PrancSTR

prancSTR quantifies evidence of somatic mosaicism at STRs using VCF files generated by HipSTR.

Tool overview

Population-level heterogeneity arises due to germline mutations that occur before the formation of the zygote and are inherited by all cells in the offspring. However, heterogeneity within an individual may also exist due to somatic mutations that occur post-zygotically in only a sub-population of cells.

prancSTR is a tool for detecting somatic mosaicism at STRs using high throughput sequencing data. It is designed to be run downstream of a germline TR genotyper. It currently only supports analysis of VCF files output by HipSTR. Note that prancSTR does not require a matched control sample as input. prancSTR uses the following fields from HipSTR VCFs for detecting mosaicism:

  • GT is used to obtain estimated diploid repeat lengths

  • MALLREADS is used to obtain the observed distribution of copy numbers across all reads aligning to a locus.

  • Stutter parameters are obtained from INFRAME_UP, INFRAME_DOWN, and INFRAME_PGEOM.

prancSTR is in beta.

Usage

To run prancSTR use the following command:

prancSTR \
  --vcf <vcf file> \
  --out <string> \
  [filter options]

Required parameters:

  • --vcf <VCF> Input VCF file, generated by HipSTR.

  • --out <string> Output file prefix. Use stdout to print file to standard output

prancSTR will output a tab-delimited file quantifying evidence of mosaicism at each STR either to stdout or to $out.tab. See a description of the output file below.

Other general parameters:

  • --region <string>: Restrict to the region chr:start-end. VCF file must be bgzipped and indexed to use this option.

  • --samples <string>: Restrict to the given list of samples. Samples are comma separated.

  • --vcftype <string>: Specify the tool which generated the vcf call file for STRs. Currently this will fail if using anything other than hipstr VCFs.

  • --only-passing: Filters out the VCF records with non-passing FILTER column

  • --output-all: Force tool to output results for all loci. Overrides :code:--only-passing.

  • --readfield <string>: Specify which VCF format field output by HipSTR to utilize for extracting read information. We recommend setting this to “MALLREADS”. “ALLREADS” is also accepted but we have found that it produces unreliable results.

  • --debug: Print helpful debug messages.

  • --quiet: Restrict printing of any messages.

  • --version: Print the version of the tool

Notes:

  • For the --only-passing option, the FILTER column is generated by dumpSTR or another upstream filtering tool based on user-defined filtering parameters. It is not determined by prancSTR.

  • By default, STRs with minimal evidence of mosaicism or suspicious read count fields are skipped to save time. These include loci where reads show evidence of only a single allele or for which the germline alleles called by HipSTR are not actually supported by any reads. To force results to be output for all loci, use the --output-all parameter. This overrides --only-passing.

See Example Commands for examples running prancSTR under different settings.

Output files

The prancSTR output file contains mosaicism predictions generated for each locus. Note, this file contains statistics for all tested loci and it is up to the user’s discretion to filter out for high confidence mosaic allele calls. The output generated is a tab-delimited file with one row summarizing evidence of mosaicism for each call analyzed, with the following columns:

  • sample: The ID of the sample being considered.

  • chrom: Chromsome of the STR being considered

  • locus: Reference ID for the short tandem repeat.

  • motif: The nucleotide sequence of the repeat unit.

  • A: The first germline allele for the given STR in repeat units relative to the reference (copied from HipSTR).

  • B: The second germline allele for the given STR in repeat units relative to the reference (copied from HipSTR)

  • C: Candidate mosaic allele inferred by prancSTR in repeat units relative to the reference (inferred by prancSTR).

  • f: Estimated mosaic allele fraction (inferred by prancSTR).

  • pval: Gives the p-value testing the null hypothesis that f=0.

  • reads: Gives representation for how many reads support each allele (copied from HipSTR VCF field corresponding to the specified --readfield).

  • mosaic_support: The number of reads that support the mosaic allele.

  • stutter parameter u: The probability that stutter error causes an increase in obs. STR size (copied from HipSTR INFRAME_UP field).

  • stutter parameter d: The probability that stutter error causes a decrease in obs. STR size (copied from HipSTR INFRAME_DOWN field).

  • stutter parameter rho: Parameter for geometric step size distribution of stutter errors (copied from HipSTR INFRAME_PGEOM field).

  • quality factor: Quality score of the germline genotype (copied from HipSTR Q field).

  • read depth: Reports the total depth/number of informative reads for all samples at the locus (copied from HipSTR DP field).

Below shows several example output lines from running prancSTR:

sample

chrom

pos

locus

motif

A

B

C

f

pval

reads

mosaic_support

stutter parameter u

stutter paramter d

stutter paramter rho

quality factor

read depth

NA07022

chr1

987287

Human_STR_285

T

3

5

2

0.244079

1.530865e-04

2|4;3|4;5|4;6|1

4

0.01

0.07

0.31

0.98

21

NA12716

chr1

987287

Human_STR_285

T

5

5

4

0.265958

1.842418e-05

4|6;5|15;9|1

6

0.01

0.07

0.31

1.00

34

NA06989

chr1

1002414

Human_STR_295

T

-1

4

3

0.150960

6.045544e-05

-1|16;3|4;4|9

4

0.02

0.02

0.69

1.00

50

NA10847

chr1

1002414

Human_STR_295

T

4

5

3

0.280689

1.290112e-09

2|1;3|7;4|6;5|11

7

0.02

0.02

0.69

1.00

55

NA12347

chr1

1002414

Human_STR_295

T

5

5

4

0.262358

1.029537e-05

3|1;4|5;5|14;6|1

5

0.02

0.02

0.69

0.99

51

As a starting point, we suggest filtering output on the following parameters to obtain candidate mosaic sites:

  • pval:of less than or equal to 0.05/(number of STRs tested). The number of STRs tested is equal to the number of data lines in the prancSTR output file.

  • read depth: of greater than or equal to 10

  • quality factor of greater than or equal to 0.8

  • mosaic_support of greater than or equal to 3

  • f: of less than equal to 0.3. Higher f values are often indicative of a heterozygous genotype miscalled as homozygous.

Example Commands

Below are prancSTR examples using HipSTR VCFs. Data files can be found at https://github.com/gymrek-lab/TRTools/tree/master/example-files:

# Example command running prancSTR for only one chromosome with hipstr output file
# --only-passing skips VCF records with non-passing filters
prancSTR \
   --vcf example-files/CEU_subset.vcf.gz \
   --out CEU_chr1  \
   --vcftype hipstr \
   --only-passing \
   --region chr1

# Example command running prancSTR for only one sample
# --only-passing skips VCF records with non-passing filters
prancSTR \
   --vcf example-files/CEU_subset.vcf.gz \
   --only-passing \
   --out NA12878_chr1 \
   --samples NA12878

Citations

If you utilize prancSTR in your work, please cite:

Aarushi Sehgal, Helyaneh Ziaei Jam, Andrew Shen, Melissa Gymrek, Genome-wide detection of somatic mosaicism at short tandem repeats, Bioinformatics, Volume 40, Issue 8, August 2024, btae485, https://doi.org/10.1093/bioinformatics/btae485