Computing per-locus TR statistics ================================= Tools used: mergeSTR, statSTR This vignette shows how to use :code:`mergeSTR` to merge two VCF files and :code:`statSTR` to compute statistics across different sample groups for an example TR locus. It uses the example VCF files :code:`ceu_ex.vcf.gz` and :code:`yri_ex.vcf.gz` available at https://github.com/gymrek-lab/TRTools/tree/master/example-files. These VCFs were generated by GangSTR on samples sequenced by the 1000 Genomes Project. They have already been sorted and indexed. After downloading the VCF files, we can use :code:`mergeSTR` to merge them into a single VCF:: mergeSTR --vcfs ceu_ex.vcf.gz,yri_ex.vcf.gz --out merged This will create the output :code:`merged.vcf`. We should zip and index the file:: bgzip merged.vcf tabix -p vcf merged.vcf.gz Now, we can use :code:`statSTR` to compute various TR-level statistics across all samples or different groups of samples. Below, we'll use statSTR to compute mean allele length, heterozygosity, and allele counts, separately for each population at each locus:: # Get the CEU and YRI sample lists bcftools query -l yri_ex.vcf.gz > yri_samples.txt bcftools query -l ceu_ex.vcf.gz > ceu_samples.txt # Run statSTR on region chr21:35348646-35348646 (hg38) statSTR \ --vcf merged.vcf.gz \ --samples yri_samples.txt,ceu_samples.txt \ --sample-prefixes YRI,CEU \ --out stdout \ --mean --het --acount \ --use-length \ --region chr21:34351482-34363028 Let's go through what each option did: * :code:`--vcf` gives the required input VCF file. Since we'll be using the :code:`--region` option, the file needs to be sorted, bgzipped, and indexed. * :code:`--out` is the required output prefix. Use :code:`stdout` as in the above example to output to the terminal screen. * :code:`--samples` gives a comma-separated list of files containing sample groups to plot separately. Here we want to compute YRI and CEU stats separately and have provided two files with the teo sample lists. * :code:`--sample-prefixes` tells the script what to name the outputs for the different sample groups. These labels get used in the column headers of the output file. * :code:`--region`: tells the script which region to process. Here we're just processing a couple loci for the sake of example. * :code:`--mean`: outputs a column with the mean repeat length (in terms of number of repeat units) * :code:`--het`: outputs a column with heterozygosity of the locus * :code:`--acount`: outputs a column with allele counts at each locus * :code:`--use-length`: Means to consider only allele lengths, rather than sequences, when computing statistics. Here, since we have two labeled sample lists, columns will be named: :code:`-YRI` and :code:`-CEU`, where stat is either mean, het, or acount. The above command outputs the following:: chrom start end acount-YRI acount-CEU het-YRI het-CEU mean-YRI mean-CEU chr21 34351482 34351499 9.0:190,10.0:24 9.0:192,10.0:4 0.19914403004629233 0.03998334027488548 9.11214953271028 9.020408163265305 chr21 34358397 34358411 14.0:19,15.0:180,16.0:15 15.0:196 0.279718752729496 0.0 14.981308411214952 15.0 chr21 34361277 34361288 3.0:214 3.0:196 0.0 0.0 3.0 3.0 chr21 34362436 34362447 3.0:214 3.0:196 0.0 0.0 3.0 3.0 chr21 34362864 34362881 3.0:212 3.0:196 0.0 0.0 3.0 3.0 chr21 34363028 34363039 4.0:65,5.0:149 4.0:142,5.0:54 0.42296270416630277 0.3992086630570595 4.696261682242991 4.275510204081632 So we can see, for example, that at the TR at chr21:34351482-34351499, there were 109 YRI alleles with 9 repeats and 24 with 10 repeats. Similarly, the mean length was 9.1 in YRI and 9.0 in CEU.