GATK, or Genome Analysis Toolkit, is a comprehensive software suite designed to identify genetic variations from DNA sequencing data. The process involves four main steps: preprocessing the raw sequencing data, calling potential variants, performing joint genotyping across multiple samples, and filtering out false positives to ensure high-quality variant calls.
The first step in GATK variant calling is preprocessing the raw sequencing data. This involves aligning reads to a reference genome using tools like BWA, followed by sorting and indexing the resulting BAM files. Duplicate reads are marked to avoid bias, and Base Quality Score Recalibration is performed using known variant sites to improve the accuracy of quality scores throughout the genome.
HaplotypeCaller is GATK's primary variant calling algorithm. It works by first identifying active regions in the genome where variation is likely present. For each active region, it performs local de novo assembly to construct potential haplotypes representing different alleles. These haplotypes are then aligned back to the reference genome, and likelihood scores are calculated based on the observed read evidence. Finally, the most probable genotype is determined for each sample at each variant site, producing output in gVCF format.
Joint genotyping is a crucial step when analyzing multiple samples together. Individual gVCF files from each sample are first combined using CombineGVCFs into a genomics database. Then GenotypeGVCFs performs joint analysis across all samples simultaneously, considering evidence from the entire cohort. This approach significantly improves sensitivity for rare variants and ensures consistent genotype calls across samples, while also enabling population-level frequency estimation.
In summary, GATK variant calling is a sophisticated multi-step process that ensures accurate identification of genetic variations. The pipeline begins with careful preprocessing of sequencing data, followed by HaplotypeCaller's innovative local assembly approach for variant detection. Joint genotyping across multiple samples enhances sensitivity, particularly for rare variants, while advanced filtering methods like VQSR help distinguish true variants from sequencing artifacts, resulting in high-quality variant calls suitable for downstream genomic analysis.