annotaTR

AnnotaTR takes in a TR genotype VCF file and outputs a new file (VCF or PGEN) with additional INFO/FORMAT fields

Current use cases include:

Annotating a TR VCF file with Beagle-imputed genotypes to add back TR metadata from a reference panel
Computing dosages for each genotype, which can be used for downstream association testing.

AnnotaTR can be used to do one or both of these tasks in a single command.

Usage

The following is a basic annotaTR command:

annotaTR \
  --vcf <vcf file> \
  --out <string> \
  [annotation options]

Required parameters:

--vcf <VCF>: VCF file with TR genotypes that has been generated by a supported genotyping tool (or imputed from one using Beagle). The VCF file must be bgzipped and indexed.
--out <string>: prefix to name output files.

Other general parameters:

--vcftype <string>: Which genotyping tool generated the input VCF. Default = auto. Necessary if it cannot be automatically inferred. One of: gangstr, advntr, hipstr, longtr, eh, popstr.
--outtype <string>: Which output format to generate. Supported output types are vcf or pgen. If multiple types are listed (e.g. --outtype vcf pgen), all specified output formats are generated. By default, only VCF output is generated at $outprefix.vcf. If PGEN output is specified, the files $outprefix.pgen, $outprefix.pvar, and $outprefix.psam are generated. See more on the pgen format here
--vcf-outtype <string>: Type of VCF output to produce. This option is ignored unless using --outtype vcf. Options: z=compressed VCF, v=uncompressed VCF, b=compressed BCF, u=uncompressed BCF, s=stdout
--region <str>: Restrict to processing a specific genomic region. Syntax: chr:start-end.
--chunk-size <int>: If writing PGEN output, load dosages in chunks of this many variants at a time. This can avoid out-of-memory errors if you are processing very large VCF files.

In addition to specifying input and output options above, you must specify at least one annotation operation to perform. These are described below.

Annotation options

annotaTR offers several annotation options:

Computing dosages

Dosages are quantitative representations of individual-level genotypes. A major use case for dosages is to perform association testing between TR genotypes and a phenotype of interest. Major advantages of using dosages are:

They can be used as input to downstream tools that support association testing with dosage values, even if those tools do not explicitly support TRs or multi-allelic sites. This has currently been tested using PGEN files output by associaTR as input to plink.
This enables testing TRs alongside other variant types using the same pipelines, streamlining workflows that include multiple variant types.

The following command computes dosages:

annotaTR \
  --vcf <vcf file> \
  --out <string> \
  --dosages <string>

where the argument to the --dosages option specifies the method to compute dosages. It must be one of:

bestguess: dosages are computed by summing the length of alleles (given in number of repeat units) for each call. e.g. for a genotype heterozygous for 3 and 4 copies of a repeat, the best guess dosage will be 7.
bestguess_norm: same as above, but scaled to be between 0 and 2. This is required if generating pgen output, since pgen only supports dosage values in this range.
beagpleap: dosages for imputed calls are computed in a way that takes into account imputation uncertainty. This option requires the Beagle VCF FORMAT fields AP1 and AP2 to be present, which give the probability of each alternate allele on each of the two haplotypes of an individual. AP-based dosages are generated by computing a weighted sum of allele lengths for each haplotype and added together to get the total dosage for the genotype ($\sum_{hap=1,2} \sum_{allele a} P(a_{hap})len(a)$). In the case of imputation with no uncertainty (all AP fields are either 0 or 1), dosages computed in this way should match best guess dosages.
beagleap_norm: same as above, but scaled to be between 0 and 2.

Additional dosage options:

--warn-on-AP-error: Output a warning, rather than quit, if invalid AP fields are encountered (i.e. if AP values are not found, do not sum to 1 or are negative, or if invalid dosage values are encountered after normalizing).

If annotating dosages, the following fields are added to VCF output files:

FORMAT field TRDS. This is a float value giving the estimated TR dosage for each call.
INFO field DSLEN. This records the minimum and maximum allele lengths, which can be useful to know if computing normalized dosages but you later want to recover the unnormalized dosage values.

Dosages may also be output to PGEN format. Because dosages are not explicitly supported for multi-allelic sites, we output an accompanying dummy pvar file in which:

All TR variants are listed with reference alleles of “A” and alternate alleles of “T”. These are just placeholders since those fields are required.
We include the DSLEN field described above under VCF format.

Annotating imputed TR VCFs

Note this functionality replaces the script trtools_prep_beagle_vcf.sh.

When TRs cannot be directly genotyped from sequencing data, an alternative is to impute them using Beagle with a phased reference panel of SNPs+TRs (e.g. see our latest reference panel from EnsembleTR).

Using Beagle to impute TRs will output a VCF with both SNPs+TRs, and strips the TR-specific fields required by TRTools (see the callers page for more details). annotaTR can be used to add back these fields:

annotaTR \
  --vcf <imputed vcf file> \
  --out <string> \
  --ref-panel <refpanel vcf file> \
  [--outtype <string>]

where:

--vcf gives the imputed VCF file, which can be the file directly output by Beagle.
--ref-panel gives the VCF file of the reference panel used for imputation with Beagle.

Additional relevant options:

--match-refpanel-on <string>: indicates how to match loci between the reference panel and the target VCF. Options: locid, rawalleles, trimmedalleles (Default:locid)
- locid matches on the ID in the VCF file. If your reference panel does not have informative IDs for TRs (e.g. all are set to “.”), this option will not work and annotaTR will output an error
- rawalleles means loci are matched on chrom:pos:ref:alt
- trimmedalleles means loci are matched on chrom:pos:ref:alt but ref and alt alleles are trimmed to remove common prefixes/suffixes. The trimmedalleles option must be used if you merged samples in your target VCF file using bcftools merge, since that tool will modify alleles to remove common sequence (see this issue)
--ignore-duplicates: This flag outputs a warning if duplicate loci are detected in the reference. If this flag is not set and a duplicate locus is detected, the program quits.
--update-ref-alt: Update the REF/ALT allele sequences from the reference panel. Fixes issue with alleles being chopped after bcftools merge. Use with caution as this assumes allele order is exactly the same between the refpanel and target VCF. Only works when matching on locus id.

If generating a VCF output file, this command will output a new file containing only STRs, with the following fields added back depending on the genotyper used to generate the reference panel:

For HipSTR-based or LongTR-based reference panels: INFO fields START, END, PERIOD are added
For adVNTR: INFO fields RU, VID are added
For GangSTR: INFO field RU is added
For ExpansionHunter: INFO fields RU, VARID, RL are added

If generating PGEN output, these fields will not be explicitly output but will be added during processing of the input VCF to enable computing dosages to output to the PGEN file. In all cases only TRs (and not SNPs or other variants in the reference panel) are included in the final output file.

Notes on output files

VCF output files are supported for all operations (currently: annotation of Beagle output and computing dosages)
PGEN output is only supported when computing normalized dosages.

Example commands

Below are annotaTR examples using data files that can be found at https://github.com/gymrek-lab/TRTools/tree/master/example-files and https://github.com/gymrek-lab/TRTools/tree/master/trtools/testsupport:

# Add normalized dosages to a TR-containing VCF file output by GangSTR
annotaTR --vcf trio_chr21_gangstr.sorted.vcf.gz --out test_gangstr_dosage --dosages bestguess

# Add non-normalized dosages to a TR-containing VCF file output by GangSTR
annotaTR --vcf trio_chr21_gangstr.sorted.vcf.gz --out test_gangstr_dosage_norm --dosages bestguess_norm

# Add normalized dosages to a TR-containing VCF file output by HipSTR and output to PGEN
annotaTR --vcf trio_chr21_hipstr.sorted.vcf.gz --vcftype hipstr --dosages bestguess_norm --out test_hipstr_dosage --outtype pgen

# Add normalized dosages and annotate a VCF file with TR genotypes (and SNPs) imputed by
# Beagle and output to both VCF and PGEN
annotaTR --vcf 1kg_snpstr_21_first_100k_second_50_STRs_imputed.vcf.gz --vcftype hipstr --ref-panel 1kg_snpstr_21_first_100k_first_50_annotated.vcf.gz --outtype vcf pgen --dosages bestguess_norm --out test_beagle

# Compute dosages based on Beagle AP field
# Require setting --match-refpanel-on since locus IDs are "." in this panel
annotaTR --vcf beagle_imputed_withap.vcf.gz --vcftype hipstr --ref-panel beagle_refpanel.vcf.gz --match-refpanel-on trimmedalleles --dosages beagleap --out test_beagleap