Comparing TR calls across different genotypers

Tools used: mergeSTR, compareSTR

This vignette shows how to use mergeSTR to merge VCFs from multiple samples into a single VCF, and compareSTR to compare VCF files generated by different genotypers (HipSTR and ExpansionHunter) using the same set of reference TRs. In this example, we use VCF files available at https://github.com/gymrek-lab/TRTools/tree/master/example-files:

  • NA12878_chr21_eh.sorted.vcf.gz, NA12891_chr21_eh.sorted.vcf.gz, and NA12892_chr21_eh.sorted.vcf.gz generated using ExpansionHunter on three separate samples

  • trio_chr21_hipstr.sorted.vcf.gz generated using HipSTR run jointly on all three samples.

Both VCFs were generated using the same reference set of TRs to make them comparable. We’ll also need the file hg19.fa.fai for adding the appropriate header lines to the ExpansionHunter VCF files. (You can generate this file from a reference genome fasta file using samtools faidx).

First, we’ll want to merge the three samples called separately by ExpansionHunter into a single VCF file using mergeSTR. We’ll need to do one VCF cleanup using bcftools reheader to make sure the appropriate contig lines are present in the VCF file:

# Add contig lines to the headers
bcftools reheader -f hg19.fa.fai -o NA12878_chr21_eh.reheader.vcf.gz  NA12878_chr21_eh.sorted.vcf.gz
bcftools reheader -f hg19.fa.fai -o NA12891_chr21_eh.reheader.vcf.gz  NA12891_chr21_eh.sorted.vcf.gz
bcftools reheader -f hg19.fa.fai -o NA12892_chr21_eh.reheader.vcf.gz  NA12892_chr21_eh.sorted.vcf.gz
tabix -p vcf NA12878_chr21_eh.reheader.vcf.gz
tabix -p vcf NA12891_chr21_eh.reheader.vcf.gz
tabix -p vcf NA12892_chr21_eh.reheader.vcf.gz

# Merge all the EH samples to one VCF
mergeSTR --vcfs NA12878_chr21_eh.reheader.vcf.gz,NA12891_chr21_eh.reheader.vcf.gz,NA12892_chr21_eh.reheader.vcf.gz \
    --out trio_chr21_eh

# Bgzip and index the output VCF to get ready for compareSTR
bgzip trio_chr21_eh.vcf
tabix -p vcf trio_chr21_eh.vcf.gz

Now, we have a file from each tool (trio_chr21_hipstr.sorted.vcf.gz and trio_chr21_eh.vcf.gz) containing genotypes for all three samples. We can use compareSTR to compare these:

compareSTR --vcf1 trio_chr21_hipstr.sorted.vcf.gz --vcf2 trio_chr21_eh.vcf.gz \
   --vcftype1 hipstr --vcftype2 eh --out hipstr_vs_eh

This command tells compareSTR to compare these two VCFs. By specifying --vcftype1 and --vcftype2 we are telling it what format the two VCFs are in. Note, TRTools will try to automatically infer this anyway, and make sure what we infer matches what was specified.

This should output several text files:

  • hipstr_vs_eh-overall.tab will give overall concordance info:

    period    concordance-seq    concordance-len    r2                numcalls
    ALL       0.9533784988411481 0.9741486896059903 0.734005622224775 11218
    

Since we did not specify the --period option, we are considering “ALL” repeat units. This file shows the overall percent of matching calls based on allele sequences (concordance-seq) and allele lengths (concordance-len), the squared Pearson between genotypes (sum of allele lengths) (r2), and the number of calls compared (numcalls).

  • hipstr_vs_eh-callcompare.tab gives a call by call comparison.

  • hipstr_vs_eh-locuscompare.tab gives per-locus comparison statistics.

  • hipstr_vs_eh-samplecompare.tab gives per-sample comparison statistics::

    sample metric-conc-seq metric-conc-len numcalls NA12891 0.9540475554368154 0.9746192893401016 3743 NA12878 0.9526864474739375 0.9740711039828923 3741 NA12892 0.953401178361007 0.9737546866630958 3734

mergeSTR will also output several plots to visualize the comparison results:

  • hipstr_vs_eh-samplecompare.pdf

_images/hipstr_vs_eh-samplecompare.jpg
  • hipstr_vs_eh-locuscompare.pdf

_images/hipstr_vs_eh-locuscompare.jpg
  • hipstr_vs_eh-bubble-periodALL.pdf

_images/hipstr_vs_eh-bubble-periodALL.jpg