DumpSTR filters VCF files with TR genotypes, performing call-level and locus-level filtering, and outputs a filtered VCF file.
To run dumpSTR use the following command:
dumpSTR \ --vcf <vcf file> \ --out <string> \ [filter options]
--vcf <VCF>VCF file to filter that has been generated by a supported genotyping tool.
--out <string>prefix to name output files.
DumpSTR will output a filtered VCF file named
$out.vcf, a sample log file
$out.samplog.tab, and a locus log file
See a description of output files below.
Other general parameters:
--vcftype <string>: Which genotyping tool generated the input VCF. Default =
auto. Necessary if it cannot be automatically inferred. One of:
--zipto output a bgzipped filtered VCF (
$out.vcf.gz) and tabix index (
$out.vcf.gz.tbi) instead of
--num-records <int>: only process this many records from the input VCF file
The available filters are described below.
See Example Commands for running dumpSTR on different supported TR genotypers using example VCFs in this repository.
DumpSTR offers the following types of filters:
These filters are not specific to any tool and can be applied to any VCF file:
--min-locus-callrate <float>: Filters loci with too few calls
--min-locus-hwep <float>: Filters loci departing from Hardy Weinberg equilibrium at some p-value threshold. Based on a two-sided binomial test comparing the observed vs. expected percent of calls that are homozygous
--min-locus-het <float>: Filters loci with low heteroyzgosity. Heterozygosity is equal to:
where p_i is the frequency of allele i
--max-locus-het <float>: Filters loci with high heterozygosity
--use-length: Use allele lengths, rather than sequences, to compute heterozygosity and HWE (only relevant for HipSTR, which reports sequence level differences in TR alleles)
--filter-regions <BEDFILE[,BEDFILE12,...]>: Filter TRs overlapping the specified set of regions. Must be used with
--filter-regions-names. Can supply a comma-separated list to each to apply multiple region filters. Bed files must be sorted and tabix-indexed.
--filter-regions-names <string[,string2,...]>: Filter names for each BED file specified in
--filter-hrun: Filter repeats with long homopolymer runs. Only used for HipSTR VCF files otherwise ignored. Filters pentanucleotides with homopolymer runs of 5bp or longer, or hexanucleotides with homopolymer runs of 6bp or longer.
--drop-filtered: Do not output loci that were filtered, or loci with no calls remaining after filtering.
TRs passing all locus-level filters will be marked as “PASS” in the FILTER field.
Those failing will have a list of failing filters in the FILTER field.
If a TR has no calls remaining after applying the call-level filters, then it will be
marked as “NO_CALLS_REMAINING” and not “PASS”.
drop-filtered is specified, only loci passing all filters will be output.
Otherwise failing loci will be output as well. In that case the only
indication that they have failed will be the FILTER field. The call-level genotypes will
still be present in the output VCF.
Different call-level filters are available for each supported TR genotyping tool.
Genotypes which fail any call-level filters are replaced by no calls and the
FILTER format field for that call is set to the reason(s) that call was filtered.
For example, if running with
--hipstr-min-call-DP 100, a call with DP
34 would be filtered with the text
Calls which were already nocalls before the dumpSTR run are left unchanged and
their FILTER format field is set to NOCALL.
Caveat: No call level filters have yet been written for VCFs imputed by Beagle.
AdVNTR call-level filters¶
--advntr-min-call-DP <int>: Minimum call coverage. Based on DP field.
--advntr-max-call-DP <int>: Maximum call coverage. Based on DP field.
--advntr-min-spanning <int>: Minimum spanning read count (SR field)
--advntr-min-flanking <int>: Minimum flanking read count (FR field)
--advntr-min-ML <float>: Minimum value of maximum likelihood (ML field)
ExpansionHunter call-level filters¶
--eh-min-call-LC <int>: Minimum call coverage. Based on LC field.
--eh-max-call-LC <int>: Maximum call coverage. Based on LC field.
GangSTR call-level filters¶
--gangstr-min-call-DP <int>: Minimum call coverage. Based on DP field.
--gangstr-max-call-DP <int>: Maximum call coverage. Based on DP field.
--gangstr-min-call-Q <float>: Minimum call quality score. Based on Q field.
--gangstr-expansion-prob-het <float>: Expansion prob-value threshold. Filters calls with probability of heterozygous expansion less than this. Based on QEXP field.
--gangstr-expansion-prob-hom <float>: Expansion prob-value threshold. Filters calls with probability of homozygous expansion less than this. Based on QEXP field.
--gangstr-expansion-prob-total <float>: Expansion prob-value threshold. Filters calls with probability of homozygous or heterozygous expansion less than this. Based on QEXP field.
--gangstr-filter-span-only: Filter out all calls that only have spanning read support. Based on RC field.
--gangstr-filter-spanbound-only: Filter out all reads except spanning and bounding. Based on RC field.
--gangstr-filter-badCI: Filter regions where the ML estimate is not in the CI. Based on REPCN and REPCI fields.
HipSTR call-level filters¶
--hipstr-max-call-flank-indel <float>: Maximum call flank indel rate. Computed as DFLANKINDEL/DP
--hipstr-max-call-stutter <float>: Maximum call stutter rate. PCR stutter artifacts add or remove copies of an STR’s motif to sequencing reads, resulting in observed STR sizes that differ from the size of the underlying genotype. (Source). Computed as DSTUTTER/DP
--hipstr-min-supp-reads <int>: Minimum supporting reads for each allele. Based on ALLREADS and GB fields
--hipstr-min-call-DP <int>: Minimum call coverage. Based on DP field.
--hipstr-max-call-DP <int>: Maximum call coverage. Based on DP field.
--hipstr-min-call-Q <float>: Minimum call quality score. Based on Q field.
PopSTR call-level filters¶
--popstr-min-call-DP <int>: Minimum call coverage. Based on DP field.
--popstr-max-call-DP <int>: Maximum call coverage. Based on DP field.
--popstr-require-support <int>: Require each allele call to have at least n supporting reads. Based on AD field.
DumpSTR outputs the following files:
$out.vcf: Filtered VCF file. Filtered loci have a list of failing filters in the FILTER column. An additional FORMAT:FILTER field is added to each call. This is set to PASS for passing calls. For failing calls, this is set to a list of filter reasons and the genotype is set to missing.
$out.samplog.tab: Output sample-level log info. This is a tab-delimited file with columns: sample, number of calls, and mean coverage at that sample across calls that survived dumpSTR filtering. This file also contains a column for each call-level filter indicating how many calls for that sample were filtered due to that reason. e.g. column “AdVNTRCallMinDepth” would indicate the number of adVNTR calls for that sample filtered due to low call depth (based on
--advntr-min-call-DP). Some calls are filtered for more than one reason, so the sum of filtered calls across all reasons will likely be more than the number of filtered calls.
$out.loclog.tab: Output locus-level log info. It contains the mean call rate at passing TR loci. It also contains a separate line for each filter with the number of TR loci failing that filter.
dumpSTR examples using VCFs from supported TR genotypers. Data files can be found at https://github.com/gymrek-lab/TRTools/tree/master/example-files:
# AdVNTR dumpSTR --vcf NA12878_chr21_advntr.sorted.vcf.gz --advntr-min-call-DP 100 --out test_dumpstr_advntr # ExpansionHunter dumpSTR --vcf NA12878_chr21_eh.sorted.vcf.gz --out test_dumpstr_eh --eh-min-call-LC 50 --num-records 10 --drop-filtered # GangSTR dumpSTR --vcf trio_chr21_gangstr.sorted.vcf.gz --out test_dumpstr_gangstr --min-locus-callrate 0.9 --num-records 10 # HipSTR dumpSTR --vcf trio_chr21_hipstr.sorted.vcf.gz --vcftype hipstr --out test_dumpstr_hipstr --filter-hrun --num-records 10 # PopSTR dumpSTR --vcf trio_chr21_popstr.sorted.vcf.gz --out test_dumpstr_popstr --min-locus-callrate 0.9 --popstr-min-call-DP 10 --num-records 100