Supported TR Genotypers

TRTools currently supports 5 tandem repeat genotypers. It also supports the Beagle imputation software (see below). We summarize them in the first table and provide some basic parameters of their functionality in the second. For more information on a genotyper, please see its website linked below.

Genotyper (version tested)

Use case notes

AdVNTR (v1.3.3)

Infers allele lengths. May alternatively identify putative frameshift mutations within VNTRs (6+bp repeat units). Designed for targeted genotyping of VNTRs. May be run on large panels of TRs but is compute-intenstive.

ExpansionHunter (v3.2.2)

Handles repeats with structures such as interruptions or nearby repeats. Designed for targeted genotyping of expansions at known pathogenic TRs but may be run genome-wide on short and expanded TRs using a custom TR panel.

GangSTR (2.4.4)

Designed for genome-wide genotyping of short or expanded TRs.

HipSTR (v0.6.2)

Designed for genome-wide genotyping of STR (1-6bp repeat units) alleles shorter than the read length. Can phase repeats with SNPs.

PopSTR (v2.0)

Designed for genome-wide genotyping of short or expanded TRs.


Genotyper (version tested)

Repeat unit lengths

Alleles longer than reads?

Allele type inferred

# TRs in reference

Sequencing technology

# Samples at a time

AdVNTR (v1.3.3)

6-100bp

No

Length, frameshifts

158,522 (genic hg19)

Illumina, PacBio

Single

ExpansionHunter (v3.2.2)

1-6bp. Can handle complex repeat structures specified by regular expressions

Yes

Length

25 (hg19)

PCR-free Illumina

Single

GangSTR (2.4.4)

1-20bp

Yes

Length

829,233 (hg19)

Paired-end Illumina

Many

HipSTR (v0.6.2)

1-9bp

No

Length, sequence

1,620,030 (hg19)

Illumina

Many

PopSTR (v2.0)

1-6bp

Yes

Length

540,1401 (hg38)

Illumina

Many

Since each of these tools take as input a list of TRs to genotype, they could also be used on custom panels of TR loci. Tool information and reference panel numbers shown above are based on downloads from the github repository of each tool as of July 2, 2020.

TRTools can be extended to support other genotypers that generate VCF files. We welcome community contributions to help support them. If that interests you, please see Contributing for more information.

Beagle

The Beagle software can take genotypes called by a TR genotyper in a set of reference samples and impute them into other samples that do not have directly genotyped TRs. TRTools supports TR genotypes produced by any of the above genotypers and then imputed into other samples with Beagle except for PopSTR genotypes. For each tool in this tool suite, unless it’s docs specifically say otherwise, that tool can be used on Beagle VCFs as if those VCFs were produced directly by the underlying TR genotyper, with no additional flags or arguments needed, as long as the steps below were followed to make sure the Beagle VCF is properly formatted.

Caveats:

  • Beagle provides phased best-guess genotypes for each imputed sample at each TR locus. When run with the ap or gp flags Beagle will also output probabilities for each possible haplotype/genotype, respectively. These probabilities are also called dosages. While dosages are often more informative for downstream analyses than the best-guess genotypes located in the GT format field (for instance, for association testing), TRTools currently does not support dosage based analyses and instead will only look at the GT field. Feel free to submit PRs with features that handle dosages (see the Contributing docs).

  • At each locus Beagle returns the most probable phased genotype. This will often but not always correspond to the most probable unphased genotype. For instance, it is possible that P(A|A) > P(A|B) and P(A|A) > P(B|A), but P(A/A) = P(A|A) < P(A|B) + P(B|A) = P(A/B). Similarly, it is possible that P(A|B) > P(C|D) and P(A|B) > P(D|C), but P(A/B) = P(A|B) + P(B|A) < P(C|D) + P(D|C) = P(C/D). TRTools currently does not take this into account and just uses the phased genotypes returned by Beagle. If you deem this to be an issue, feel free to submit PRs to help TRTools take this into account (see the Contributing docs).

  • For callers which return sequences, not just lengths (e.g. HipSTR), if there are loci with multiple plausible sequences of the same length, then its possible that the most probable genotype returned by Beagle does not have the most probable length. For example, the following could be true of a single haplotype: Len(S_1) = L_1, Len(S_2) = L_1, Len(S_3) = L_2 and P(S_1) < P(S_3), P(S_2) < P(S_3) but P(S_3) < P(S_1) + P(S_2).

An overview of steps to perform before Beagle imputation:

  • The samples being imputed into must have directly genotyped loci that are also genotyped in the reference samples. This allows those samples to be ‘matched’ with samples in the reference.

  • The genotypes of both the reference samples and samples of interest must be phased. That can be done by statistically phasing the genotypes prior to running Beagle imputation.

  • The referece samples must also not contain any missing genotypes. Possible methods for dealing with that include removing loci with missing genotypes or using imputation to impute the missing genotypes prior to imputing the TRs.

The VCFs that Beagle outputs need to be preprocessed before use by TRTools. We have provided a tool trtools_prep_beagle_vcf.sh to run on those VCFs. After running this script, the files should be usable by any of the tools in TRTools.

In case of error, it may be useful to know what steps the script attempts to perform:

  • It copies over source and command meta header lines from the reference panel to the imputed VCF so that it is clear which genotyper’s syntax is being used to represent the STRs in the VCF.

  • It copies over contig and ALT lines which is required for downstream tools including mergeSTR and is good practice to include in the VCF header.

  • It annotates each STR with the necessary INFO fields from the reference panel that Beagle dropped from the imputed VCF.

  • The imputed VCF contains both TR loci and the shared loci (commonly SNPs) that were used for the imputation. This script removes the non-STR loci (identified as those loci not having STR-specific INFO fields).