Computing per-locus TR statistics¶
Tools used: mergeSTR, statSTR
This vignette shows how to use
mergeSTR to merge two VCF files and
statSTR to compute statistics across different sample groups for an example TR locus. It uses the example VCF files
yri_ex.vcf.gz available at https://github.com/gymrek-lab/TRTools/tree/master/example-files. These VCFs were generated by GangSTR on samples sequenced by the 1000 Genomes Project. They have already been sorted and indexed.
After downloading the VCF files, we can use
mergeSTR to merge them into a single VCF:
mergeSTR --vcfs ceu_ex.vcf.gz,yri_ex.vcf.gz --out merged
This will create the output
merged.vcf. We should zip and index the file:
bgzip merged.vcf tabix -p vcf merged.vcf.gz
Now, we can use
statSTR to compute various TR-level statistics across all samples or different groups of samples. Below, we’ll use statSTR to compute mean allele length, heterozygosity, and allele counts, separately for each population at each locus:
# Get the CEU and YRI sample lists bcftools query -l yri_ex.vcf.gz > yri_samples.txt bcftools query -l ceu_ex.vcf.gz > ceu_samples.txt # Run statSTR on region chr21:35348646-35348646 (hg38) statSTR \ --vcf merged.vcf.gz \ --samples yri_samples.txt,ceu_samples.txt \ --sample-prefixes YRI,CEU \ --out stdout \ --mean --het --acount \ --use-length \ --region chr21:34351482-34363028
Let’s go through what each option did:
--vcfgives the required input VCF file. Since we’ll be using the
--regionoption, the file needs to be sorted, bgzipped, and indexed.
--outis the required output prefix. Use
stdoutas in the above example to output to the terminal screen.
--samplesgives a comma-separated list of files containing sample groups to plot separately. Here we want to compute YRI and CEU stats separately and have provided two files with the teo sample lists.
--sample-prefixestells the script what to name the outputs for the different sample groups. These labels get used in the column headers of the output file.
--region: tells the script which region to process. Here we’re just processing a couple loci for the sake of example.
--mean: outputs a column with the mean repeat length (in terms of number of repeat units)
--het: outputs a column with heterozygosity of the locus
--acount: outputs a column with allele counts at each locus
--use-length: Means to consider only allele lengths, rather than sequences, when computing statistics.
Here, since we have two labeled sample lists, columns will be named:
<stat>-CEU, where stat is either mean, het, or acount.
The above command outputs the following:
chrom start end acount-YRI acount-CEU het-YRI het-CEU mean-YRI mean-CEU chr21 34351482 34351499 9.0:190,10.0:24 9.0:192,10.0:4 0.19914403004629233 0.03998334027488548 9.11214953271028 9.020408163265305 chr21 34358397 34358411 14.0:19,15.0:180,16.0:15 15.0:196 0.279718752729496 0.0 14.981308411214952 15.0 chr21 34361277 34361288 3.0:214 3.0:196 0.0 0.0 3.0 3.0 chr21 34362436 34362447 3.0:214 3.0:196 0.0 0.0 3.0 3.0 chr21 34362864 34362881 3.0:212 3.0:196 0.0 0.0 3.0 3.0 chr21 34363028 34363039 4.0:65,5.0:149 4.0:142,5.0:54 0.42296270416630277 0.3992086630570595 4.696261682242991 4.275510204081632
So we can see, for example, that at the TR at chr21:34351482-34351499, there were 109 YRI alleles with 9 repeats and 24 with 10 repeats. Similarly, the mean length was 9.1 in YRI and 9.0 in CEU.