Computing per-locus TR statistics
=================================

Tools used: mergeSTR, statSTR

This vignette shows how to use :code:`mergeSTR` to merge two VCF files and :code:`statSTR` to compute statistics across different sample groups for an example TR locus. It uses the example VCF files :code:`ceu_ex.vcf.gz` and :code:`yri_ex.vcf.gz` available at https://github.com/gymrek-lab/TRTools/tree/master/example-files. These VCFs were generated by GangSTR on samples sequenced by the 1000 Genomes Project. They have already been sorted and indexed.

After downloading the VCF files, we can use :code:`mergeSTR` to merge them into a single VCF::

	mergeSTR --vcfs ceu_ex.vcf.gz,yri_ex.vcf.gz --out merged

This will create the output :code:`merged.vcf`. We should zip and index the file::

	bgzip merged.vcf
	tabix -p vcf merged.vcf.gz

Now, we can use :code:`statSTR` to compute various TR-level statistics across all samples or different groups of samples. Below, we'll use statSTR to compute mean allele length, heterozygosity, and allele counts, separately for each population at each locus::

	# Get the CEU and YRI sample lists
	bcftools query -l yri_ex.vcf.gz > yri_samples.txt
	bcftools query -l ceu_ex.vcf.gz > ceu_samples.txt

	# Run statSTR on region chr21:35348646-35348646 (hg38)
	statSTR \
        --vcf merged.vcf.gz \
    	--samples yri_samples.txt,ceu_samples.txt \
    	--sample-prefixes YRI,CEU \
    	--out stdout \
    	--mean --het --acount \
    	--use-length \
    	--region chr21:34351482-34363028

Let's go through what each option did:

* :code:`--vcf` gives the required input VCF file. Since we'll be using the :code:`--region` option, the file needs to be sorted, bgzipped, and indexed.
* :code:`--out` is the required output prefix. Use :code:`stdout` as in the above example to output to the terminal screen.
* :code:`--samples` gives a comma-separated list of files containing sample groups to plot separately. Here we want to compute YRI and CEU stats separately and have provided two files with the teo sample lists.
* :code:`--sample-prefixes` tells the script what to name the outputs for the different sample groups. These labels get used in the column headers of the output file.
* :code:`--region`: tells the script which region to process. Here we're just processing a couple loci for the sake of example.
* :code:`--mean`: outputs a column with the mean repeat length (in terms of number of repeat units)
* :code:`--het`: outputs a column with heterozygosity of the locus
* :code:`--acount`: outputs a column with allele counts at each locus
* :code:`--use-length`: Means to consider only allele lengths, rather than sequences, when computing statistics. 

Here, since we have two labeled sample lists, columns will be named: :code:`<stat>-YRI` and :code:`<stat>-CEU`, where stat is either mean, het, or acount.

The above command outputs the following::

	chrom	start	end	acount-YRI	acount-CEU	het-YRI	het-CEU	mean-YRI	mean-CEU
	chr21	34351482	34351499	9.0:190,10.0:24	9.0:192,10.0:4	0.19914403004629233	0.03998334027488548	9.11214953271028	9.020408163265305
	chr21	34358397	34358411	14.0:19,15.0:180,16.0:15	15.0:196	0.279718752729496	0.0	14.981308411214952	15.0
	chr21	34361277	34361288	3.0:214	3.0:196	0.0	0.0	3.0	3.0
	chr21	34362436	34362447	3.0:214	3.0:196	0.0	0.0	3.0	3.0
	chr21	34362864	34362881	3.0:212	3.0:196	0.0	0.0	3.0	3.0
	chr21	34363028	34363039	4.0:65,5.0:149	4.0:142,5.0:54	0.42296270416630277	0.3992086630570595	4.696261682242991	4.275510204081632

So we can see, for example, that at the TR at chr21:34351482-34351499, there were 109 YRI alleles with 9 repeats and 24 with 10 repeats. Similarly, the mean length was 9.1 in YRI and 9.0 in CEU.