Comparing TR calls across different parameter sets

Tools used: compareSTR

This vignette shows how to use compareSTR to compare two VCF files generated using the same set of reference TRs. In this example, we use VCF files c57_ex1.vcf.gz and c57_ex2.vcf.gz available at https://github.com/gymrek-lab/TRTools/tree/master/example-files. These VCF files were generated by GangSTR on a mouse dataset using two different sets of stutter parameters.

To run compareSTR:

compareSTR \
--vcf1 c57_ex1.vcf.gz \
--vcf2 c57_ex2.vcf.gz \
--vcftype1 gangstr \
--vcftype2 gangstr \
--out c57-compare \
--stratify-fields DP \
--stratify-binsizes 0:50:10 \
--bubble-min -10 --bubble-max 10

Let’s go through what each option did:

  • --vcf1 and --vcf2 give the name of the two VCF files being compared.

  • --vcftype1 and --vcftype2 give the types of the two VCF files being compared

  • --out is the required output prefix.

  • --stratify-fields DP and --stratify-binsizes 0:50:10 tell the script to compute overall concordance metrics stratifying by the DP field (coverage) in bins of 10 ranging from 0 to 50. Since we didn’t specify which file to apply the stratification to with --stratify-file, it gets applied to both.

  • --bubble-min -10 --bubble-max 10 give the axis range of the bubble plot to show (see below).

This will output a number of files. We’ll peek at a couple of them.

  • c57-compare-overall.tab will give overall concordance info:

    period  DP      concordance-seq concordance-len r2      numcalls
    ALL     NA      0.9760415527610716      0.9760415527610716      0.9909303005952199      91450
    ALL     0.0-10.0        1.0     1.0     1.0     266
    ALL     10.0-20.0       0.9426900584795321      0.9426900584795321      0.9995028621738471      855
    ALL     20.0-30.0       0.9285714285714286      0.9285714285714286      0.9950024662641009      1218
    ALL     30.0-40.0       0.9548577036310107      0.9548577036310107      0.9780821396586636      3057
    ALL     40.0-50.0       0.9739065606361829      0.9739065606361829      0.9880478071563313      8048
    

Here, the first line gives the overall concordance. The lines below are stratified by DP value. For each set, the concordance (percent of calls matching), r2 (Pearson correlation between allele calls), and number of calls in each group is shown.

  • c57-compare-callcompare.tab gives a call by call comparison, which is useful for looking at exactly which loci/sample calls were discordant.

  • c57-compare-locuscompare.tab and c57-compare-samplecompare.tab give locus and sample level concordance. In this case these are not very interesting since we have only one sample being compared. But in other settings, these files can be used to identify poorly performing samples or loci.

  • c57-compare-bubble-periodALL.pdf plots the genotypes in file 1 vs. file 2. Genotypes are given in terms of the sum across both alleles of the repeat units different from the reference. Bubble sizes give the number of calls represented by each point. This plot is shown below:

_images/c57-compare-bubble-periodALL.jpg

Since we used --bubble-min -10 --bubble-max 10, the plot goes only from -10 to +10. This plot can be useful in evaluating the effects of varying different parameters on calling. For example, we can see from this plot that there are quite a few calls that are called longer in file 2 vs. file 1.