DumpSTR

DumpSTR filters VCF files with TR genotypes, performing call-level and locus-level filtering, and outputs a filtered VCF file.

Usage

To run dumpSTR use the following command:

dumpSTR \
  --vcf <vcf file> \
  --out <string> \
  [filter options]

Required parameters:

  • --vcf <VCF> VCF file to filter that has been generated by a supported genotyping tool.

  • --out <string> prefix to name output files.

DumpSTR will output a filtered VCF file named $out.vcf, a sample log file $out.samplog.tab, and a locus log file $out.loclog.tab. See a description of output files below.

Other general parameters:

  • --vcftype <string>: Which genotyping tool generated the input VCF. Default = auto. Necessary if it cannot be automatically inferred. One of: gangstr, advntr, hipstr, longtr, eh, popstr.

  • --zip to output a bgzipped filtered VCF ($out.vcf.gz) and tabix index ($out.vcf.gz.tbi) instead of $out.vcf.

  • --num-records <int>: only process this many records from the input VCF file

The available filters are described below.

See Example Commands for running dumpSTR on different supported TR genotypers using example VCFs in this repository.

Warning

dumpSTR<6.0.0 and scipy>=1.12.0 are incompatible. You should update to a newer version of TRTools if you encounter an error indicating that scipy.stats.binom_test has been deprecated.

Filter options

DumpSTR offers the following types of filters:

Locus-level filters

These filters are not specific to any tool and can be applied to any VCF file:

  • --min-locus-callrate <float>: Filters loci with too few calls

  • --min-locus-hwep <float>: Filters loci departing from Hardy Weinberg equilibrium at some p-value threshold. Based on a two-sided binomial test comparing the observed vs. expected percent of calls that are homozygous

  • --min-locus-het <float>: Filters loci with low heteroyzgosity. Heterozygosity is equal to:

\[1-\sum_i p_i^2\]

where p_i is the frequency of allele i

  • --max-locus-het <float>: Filters loci with high heterozygosity

  • --use-length: Use allele lengths, rather than sequences, to compute heterozygosity and HWE (only relevant for HipSTR and LongTR, which report sequence level differences in TR alleles)

  • --filter-regions <BEDFILE[,BEDFILE12,...]>: Filter TRs overlapping the specified set of regions. Must be used with --filter-regions-names. Can supply a comma-separated list to each to apply multiple region filters. Bed files must be sorted and tabix-indexed. Note that regions are 0-based and inclusive of the start position, but exclusive of the end position.

  • --filter-regions-names <string[,string2,...]>: Filter names for each BED file specified in --filter-regions.

  • --filter-hrun: Filter repeats with long homopolymer runs. Only used for HipSTR or LongTR VCF files otherwise ignored. Filters pentanucleotides with homopolymer runs of 5bp or longer, or hexanucleotides with homopolymer runs of 6bp or longer.

  • --drop-filtered: Do not output loci that were filtered, or loci with no calls remaining after filtering.

TRs passing all locus-level filters will be marked as “PASS” in the FILTER field. Those failing will have a list of failing filters in the FILTER field. If a TR has no calls remaining after applying the call-level filters, then it will be marked as “NO_CALLS_REMAINING” and not “PASS”. If drop-filtered is specified, only loci passing all filters will be output. Otherwise failing loci will be output as well. In that case the only indication that they have failed will be the FILTER field. The call-level genotypes will still be present in the output VCF.

Call-level filters

Different call-level filters are available for each supported TR genotyping tool. Genotypes which fail any call-level filters are replaced by no calls and the FILTER format field for that call is set to the reason(s) that call was filtered. For example, if running with --hipstr-min-call-DP 100, a call with DP 34 would be filtered with the text HipSTRCallMinDepth100_34. Calls which were already nocalls before the dumpSTR run are left unchanged and their FILTER format field is set to NOCALL.

Caveat: No call level filters have yet been written for VCFs imputed by Beagle.

AdVNTR call-level filters

  • --advntr-min-call-DP <int>: Minimum call coverage. Based on DP field.

  • --advntr-max-call-DP <int>: Maximum call coverage. Based on DP field.

  • --advntr-min-spanning <int>: Minimum spanning read count (SR field)

  • --advntr-min-flanking <int>: Minimum flanking read count (FR field)

  • --advntr-min-ML <float>: Minimum value of maximum likelihood (ML field)

ExpansionHunter call-level filters

  • --eh-min-call-LC <int>: Minimum call coverage. Based on LC field.

  • --eh-max-call-LC <int>: Maximum call coverage. Based on LC field.

GangSTR call-level filters

  • --gangstr-min-call-DP <int>: Minimum call coverage. Based on DP field.

  • --gangstr-max-call-DP <int>: Maximum call coverage. Based on DP field.

  • --gangstr-min-call-Q <float>: Minimum call quality score. Based on Q field.

  • --gangstr-expansion-prob-het <float>: Expansion prob-value threshold. Filters calls with probability of heterozygous expansion less than this. Based on QEXP field.

  • --gangstr-expansion-prob-hom <float>: Expansion prob-value threshold. Filters calls with probability of homozygous expansion less than this. Based on QEXP field.

  • --gangstr-expansion-prob-total <float>: Expansion prob-value threshold. Filters calls with probability of homozygous or heterozygous expansion less than this. Based on QEXP field.

  • --gangstr-filter-span-only: Filter out all calls that only have spanning read support. Based on RC field.

  • --gangstr-filter-spanbound-only: Filter out all reads except spanning and bounding. Based on RC field.

  • --gangstr-filter-badCI: Filter regions where the ML estimate is not in the CI. Based on REPCN and REPCI fields.

HipSTR call-level filters

  • --hipstr-max-call-flank-indel <float>: Maximum call flank indel rate. Computed as DFLANKINDEL/DP

  • --hipstr-max-call-stutter <float>: Maximum call stutter rate. PCR stutter artifacts add or remove copies of an STR’s motif to sequencing reads, resulting in observed STR sizes that differ from the size of the underlying genotype. (Source). Computed as DSTUTTER/DP

  • --hipstr-min-supp-reads <int>: Minimum supporting reads for each allele. Based on ALLREADS and GB fields

  • --hipstr-min-call-DP <int>: Minimum call coverage. Based on DP field.

  • --hipstr-max-call-DP <int>: Maximum call coverage. Based on DP field.

  • --hipstr-min-call-Q <float>: Minimum call quality score. Based on Q field.

LongTR call-level filters

  • --longtr-max-call-flank-indel <float>: Maximum call flank indel rate. Computed as DFLANKINDEL/DP

  • --longtr-min-supp-reads <int>: Minimum supporting reads for each allele. Based on ALLREADS and GB fields

  • --longtr-min-call-DP <int>: Minimum call coverage. Based on DP field.

  • --longtr-max-call-DP <int>: Maximum call coverage. Based on DP field.

  • --longtr-min-call-Q <float>: Minimum call quality score. Based on Q field.

PopSTR call-level filters

  • --popstr-min-call-DP <int>: Minimum call coverage. Based on DP field.

  • --popstr-max-call-DP <int>: Maximum call coverage. Based on DP field.

  • --popstr-require-support <int>: Require each allele call to have at least n supporting reads. Based on AD field.

Output files

DumpSTR outputs the following files:

  • $out.vcf: Filtered VCF file. Filtered loci have a list of failing filters in the FILTER column. An additional FORMAT:FILTER field is added to each call. This is set to PASS for passing calls. For failing calls, this is set to a list of filter reasons and the genotype is set to missing.

  • $out.samplog.tab: Output sample-level log info. This is a tab-delimited file with columns: sample, number of calls, and mean coverage at that sample across calls that survived dumpSTR filtering. This file also contains a column for each call-level filter indicating how many calls for that sample were filtered due to that reason. e.g. column “AdVNTRCallMinDepth” would indicate the number of adVNTR calls for that sample filtered due to low call depth (based on --advntr-min-call-DP). Some calls are filtered for more than one reason, so the sum of filtered calls across all reasons will likely be more than the number of filtered calls.

  • $out.loclog.tab: Output locus-level log info. It contains the mean call rate at passing TR loci. It also contains a separate line for each filter with the number of TR loci failing that filter.

Example Commands

Below are dumpSTR examples using VCFs from supported TR genotypers. Data files can be found at https://github.com/gymrek-lab/TRTools/tree/master/example-files:

# AdVNTR
dumpSTR --vcf NA12878_chr21_advntr.sorted.vcf.gz --advntr-min-call-DP 100 --out test_dumpstr_advntr

# ExpansionHunter
dumpSTR --vcf NA12878_chr21_eh.sorted.vcf.gz --out test_dumpstr_eh --eh-min-call-LC 50 --num-records 10 --drop-filtered

# GangSTR
dumpSTR --vcf trio_chr21_gangstr.sorted.vcf.gz --out test_dumpstr_gangstr --min-locus-callrate 0.9 --num-records 10

# HipSTR
dumpSTR --vcf trio_chr21_hipstr.sorted.vcf.gz --vcftype hipstr --out test_dumpstr_hipstr --filter-hrun --num-records 10

# PopSTR
dumpSTR --vcf trio_chr21_popstr.sorted.vcf.gz --out test_dumpstr_popstr --min-locus-callrate 0.9 --popstr-min-call-DP 10 --num-records 100

# LongTR
dumpSTR --vcf longtr_testfile.vcf.gz --vcftype longtr --out test_dumpstr_longtr --min-locus-callrate 0.9 --longtr-min-call-Q 0.9