Computing per-locus TR statistics

Tools used: mergeSTR, statSTR

This vignette shows how to use mergeSTR to merge two VCF files and statSTR to compute statistics across different sample groups for an example TR locus. It uses the example VCF files ceu_ex.vcf.gz and yri_ex.vcf.gz available at https://github.com/gymrek-lab/TRTools/tree/master/example-files. These VCFs were generated by GangSTR on samples sequenced by the 1000 Genomes Project. They have already been sorted and indexed.

After downloading the VCF files, we can use mergeSTR to merge them into a single VCF:

mergeSTR --vcfs ceu_ex.vcf.gz,yri_ex.vcf.gz --out merged

This will create the output merged.vcf. We should zip and index the file:

bgzip merged.vcf
tabix -p vcf merged.vcf.gz

Now, we can use statSTR to compute various TR-level statistics across all samples or different groups of samples. Below, we’ll use statSTR to compute mean allele length, heterozygosity, and allele counts, separately for each population at each locus:

# Get the CEU and YRI sample lists
bcftools query -l yri_ex.vcf.gz > yri_samples.txt
bcftools query -l ceu_ex.vcf.gz > ceu_samples.txt

# Run statSTR on region chr21:35348646-35348646 (hg38)
statSTR \
--vcf merged.vcf.gz \
--samples yri_samples.txt,ceu_samples.txt \
--sample-prefixes YRI,CEU \
--out stdout \
--mean --het --acount \
--use-length \
--region chr21:34351482-34363028

Let’s go through what each option did:

  • --vcf gives the required input VCF file. Since we’ll be using the --region option, the file needs to be sorted, bgzipped, and indexed.

  • --out is the required output prefix. Use stdout as in the above example to output to the terminal screen.

  • --samples gives a comma-separated list of files containing sample groups to plot separately. Here we want to compute YRI and CEU stats separately and have provided two files with the teo sample lists.

  • --sample-prefixes tells the script what to name the outputs for the different sample groups. These labels get used in the column headers of the output file.

  • --region: tells the script which region to process. Here we’re just processing a couple loci for the sake of example.

  • --mean: outputs a column with the mean repeat length (in terms of number of repeat units)

  • --het: outputs a column with heterozygosity of the locus

  • --acount: outputs a column with allele counts at each locus

  • --use-length: Means to consider only allele lengths, rather than sequences, when computing statistics.

Here, since we have two labeled sample lists, columns will be named: <stat>-YRI and <stat>-CEU, where stat is either mean, het, or acount.

The above command outputs the following:

chrom   start   end     acount-YRI      acount-CEU      het-YRI het-CEU mean-YRI        mean-CEU
chr21   34351482        34351499        9.0:190,10.0:24 9.0:192,10.0:4  0.19914403004629233     0.03998334027488548     9.11214953271028        9.020408163265305
chr21   34358397        34358411        14.0:19,15.0:180,16.0:15        15.0:196        0.279718752729496       0.0     14.981308411214952      15.0
chr21   34361277        34361288        3.0:214 3.0:196 0.0     0.0     3.0     3.0
chr21   34362436        34362447        3.0:214 3.0:196 0.0     0.0     3.0     3.0
chr21   34362864        34362881        3.0:212 3.0:196 0.0     0.0     3.0     3.0
chr21   34363028        34363039        4.0:65,5.0:149  4.0:142,5.0:54  0.42296270416630277     0.3992086630570595      4.696261682242991       4.275510204081632

So we can see, for example, that at the TR at chr21:34351482-34351499, there were 109 YRI alleles with 9 repeats and 24 with 10 repeats. Similarly, the mean length was 9.1 in YRI and 9.0 in CEU.