Computing per-locus TR statistics
Tools used: mergeSTR, statSTR
This vignette shows how to use mergeSTR
to merge two VCF files and statSTR
to compute statistics across different sample groups for an example TR locus. It uses the example VCF files ceu_ex.vcf.gz
and yri_ex.vcf.gz
available at https://github.com/gymrek-lab/TRTools/tree/master/example-files. These VCFs were generated by GangSTR on samples sequenced by the 1000 Genomes Project. They have already been sorted and indexed.
After downloading the VCF files, we can use mergeSTR
to merge them into a single VCF:
mergeSTR --vcfs ceu_ex.vcf.gz,yri_ex.vcf.gz --out merged
This will create the output merged.vcf
. We should zip and index the file:
bgzip merged.vcf
tabix -p vcf merged.vcf.gz
Now, we can use statSTR
to compute various TR-level statistics across all samples or different groups of samples. Below, we’ll use statSTR to compute mean allele length, heterozygosity, and allele counts, separately for each population at each locus:
# Get the CEU and YRI sample lists
bcftools query -l yri_ex.vcf.gz > yri_samples.txt
bcftools query -l ceu_ex.vcf.gz > ceu_samples.txt
# Run statSTR on region chr21:35348646-35348646 (hg38)
statSTR \
--vcf merged.vcf.gz \
--samples yri_samples.txt,ceu_samples.txt \
--sample-prefixes YRI,CEU \
--out stdout \
--mean --het --acount \
--use-length \
--region chr21:34351482-34363028
Let’s go through what each option did:
--vcf
gives the required input VCF file. Since we’ll be using the--region
option, the file needs to be sorted, bgzipped, and indexed.--out
is the required output prefix. Usestdout
as in the above example to output to the terminal screen.--samples
gives a comma-separated list of files containing sample groups to plot separately. Here we want to compute YRI and CEU stats separately and have provided two files with the teo sample lists.--sample-prefixes
tells the script what to name the outputs for the different sample groups. These labels get used in the column headers of the output file.--region
: tells the script which region to process. Here we’re just processing a couple loci for the sake of example.--mean
: outputs a column with the mean repeat length (in terms of number of repeat units)--het
: outputs a column with heterozygosity of the locus--acount
: outputs a column with allele counts at each locus--use-length
: Means to consider only allele lengths, rather than sequences, when computing statistics.
Here, since we have two labeled sample lists, columns will be named: <stat>-YRI
and <stat>-CEU
, where stat is either mean, het, or acount.
The above command outputs the following:
chrom start end acount-YRI acount-CEU het-YRI het-CEU mean-YRI mean-CEU
chr21 34351482 34351499 9.0:190,10.0:24 9.0:192,10.0:4 0.19914403004629233 0.03998334027488548 9.11214953271028 9.020408163265305
chr21 34358397 34358411 14.0:19,15.0:180,16.0:15 15.0:196 0.279718752729496 0.0 14.981308411214952 15.0
chr21 34361277 34361288 3.0:214 3.0:196 0.0 0.0 3.0 3.0
chr21 34362436 34362447 3.0:214 3.0:196 0.0 0.0 3.0 3.0
chr21 34362864 34362881 3.0:212 3.0:196 0.0 0.0 3.0 3.0
chr21 34363028 34363039 4.0:65,5.0:149 4.0:142,5.0:54 0.42296270416630277 0.3992086630570595 4.696261682242991 4.275510204081632
So we can see, for example, that at the TR at chr21:34351482-34351499, there were 109 YRI alleles with 9 repeats and 24 with 10 repeats. Similarly, the mean length was 9.1 in YRI and 9.0 in CEU.