SimTR

SimTR simulates next generation sequencing reads at a single TR region while modeling stutter errors common in such sequencing.

Prerequisites

simTR is a wrapper on the Illumina ART tool. To run simTR, either the command art_illumina must be in a directory on your PATH, or alternatively you can specify the path to the ART executable using the --art option. If you installed simTR via conda, ART should be installed already.

Usage

To run simTR use the following command:

simTR \
  --ref <fasta file> \
  --coords <chr:start-end> \
  --repeat-unit <str> \
  --outprefix <str>
  [additional options]

Required parameters:

  • --ref: Path to the reference fasta file

  • --coords: The coordinates of the TR from which to simulate reads. Format: chrom:start-end.

  • --repeat_unit: The sequence of the repeated unit.

  • --outprefix: Prefix to name output files.

By default, simTR will simulate paired end reads and output reads to $outprefix_1.fq and $outprefix_2.fq. Options described below allow for changing the error model or sequencing parameters.

Stutter model

Insertions or deletions of repeat units (commonly referred to as stutter errors) are simulated according to the model specified in the HipSTR manuscript. The model can be specified using three optional parameters:

  • --u <float>: Probability a read contains stutter error that increases the total number of repeat units (Default: 0.05)

  • --d <float>: Probability a read contains stutter error that decreases the total number of repeat units (Default: 0.05)

  • --rho <float>: The step size parameter. Stutter error sizes are drawn from a geometric distribution with parameter rho (Default: 0.9)

  • --p-thresh <float>: Ignore stutter alleles expected to have less than this frequency (Default: 0.001)

  • --seed <int>: Set the random seed for reproducibility.

Sequencing options

You can specify the following sequencing options:

  • --single: Simulate single end reads. By default, paired end reads are output.

  • --coverage <int>: Target coverage to simulate (ART parameter -f). (Default: 1000)

  • --read-length <int>: Read length (ART parameter -l). (Default: 100)

  • --insert <float>: Mean fragment length for paired end reads (ART parameter -m). (Default: 350)

  • --sd <float>: Standard deviation of the fragment length for paired end reads (ART parameter -s). (Default: 50)

  • --window <int>: Window size (bp) around the target TR to simulate. (Default: 1000)

Additional options

Users can optionally specify the following:

  • --tmpdir <str>: store temporary files within this directory. Otherwise, a folder in $TMPDIR is created. Temporary files include “dummy” fasta files with different TR alleles and intermediate fastqs generated by ART.

  • --art <str>: path to the art_illumina executable.

Known issues

  • Currently requires repeat boundaries to be exact perfect copies. Could instead infer the rotation of the repeat unit to be more robust to this.

Example Commands

Below is a simTR example using an example fasta file, which can be found in the directory example-files. Example command:

# Example command running simTR for a dummy dataset with dummy allele bed file and other input parameters
mkdir test-simtr
simTR \
   --coords chr11_CBL:5001-5033 \
   --ref example-files/CBL.fa \
   --tmpdir test-simtr \
   --repeat-unit CGG \
   --art art_illumina \
   --outprefix test-simtr \
   --coverage 1000 \
   --read-length 150 \
   --seed 12345 \
   --u 0.02 --d 0.02 --rho 0.9

This command should output the files test-simtr_1.fq and test-simtr_2.fq.

Citations

A preprint describing simTR and prancSTR is currently being prepared.