trtools.utils.utils

Util functions for calculating summary STR statistics and performing basic string operations on STR alleles.

class trtools.utils.utils.ArgumentDefaultsHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: argparse.HelpFormatter

Build a custom argument parser that works just like argparse.ArgumentDefaultsHelpFormatter except that it doesn’t display None defaults. Everything below is copied from the python library source except for that.

Help message formatter which adds default values to argument help.

Only the name of this class is considered a public API. All the methods provided by the class are considered an implementation detail.

trtools.utils.utils.FabricateAllele(motif, length)

Fabricate an allele with the given motif and length.

Parameters
  • motif (str) – the motif to build the allele from

  • length (float) – the number of times to copy the motif (noninteger implies partial repeats). This does NOT specify the desired length of the returned string.

Returns

the fabricated allele string

Return type

str

Notes

The allele is fabricated with the given motif orientation (e.g. motif = ‘ACG’ could produce ‘ACGACGACG’ but not ‘CGACGACGA’). Fabricated alleles will contain no impurities nor flanking base pairs. In the case of length being a noninteger float (because of partial repeats) and where it is unclear if the last nucleotide should be included in the fabricated repeat or not due to imprecision in the length float, the last nucleotide will always be left off (the length of the returned string will always be rounded down).

trtools.utils.utils.GetCanonicalMotif(repseq)

Get canonical STR sequence, considering both strands

The canonical sequence is the first alphabetically out of all possible rotations on + and - strands of the sequence. e.g. “TG” canonical sequence is “AC”.

Parameters

repseq (str) – String giving a STR motif (repeat unit sequence)

Returns

canon – The canonical sequence of the STR motif

Return type

str

Examples

>>> GetCanonicalMotif("TG")
'AC'
trtools.utils.utils.GetCanonicalOneStrand(repseq)

Get canonical STR sequence, considering one strand

The canonical sequence is the first alphabetically out of all possible rotations. e.g. CAG -> AGC.

Parameters

repseq (str) – String giving a STR motif (repeat unit sequence)

Returns

canon – The canonical sequence of the STR motif

Return type

str

Examples

>>> GetCanonicalOneStrand("CAG")
'AGC'
trtools.utils.utils.GetContigs(vcf)

Returns the contig IDs in the VCF.

Parameters

vcf (cyvcf2.cyvcf2.VCF) – The vcf to get contigs from

Returns

A list of contig IDs

Return type

List[str]

trtools.utils.utils.GetEntropy(allele_freqs)

Compute the (bit) entropy of a locus

Entropy is defined as the entropy <https://en.wikipedia.org/wiki/Information_content>_ of the distribution of allele frequencies.

Parameters

allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)

Returns

entropy – The entropy of the locus. If the allele frequencies dictionary is invalid, return np.nan

Return type

float

Notes

Entropy is computed as:

\[E = -\sum_{i=1..n} -p_i*log_2(p_i)\]

where p_i is the frequency of allele i and n is the number of alleles.

Examples

>>> GetEntropy({0:0.5, 1:0.5})
1.0
trtools.utils.utils.GetHardyWeinbergBinomialTest(allele_freqs, genotype_counts)

Compute Hardy Weinberg p-value

Tests whether the number of observed heterozygous vs. homozygous individuals is different than expected under Hardy Weinberg Equilibrium given the observed allele frequencies, based on a binomial test.

Parameters
  • allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)

  • genotype_counts (dict of (object, object): int) – Dictionary of counts of each genotype. Genotypes are defined as tuples of alleles. Alleles must be the same as those given in allele_freqs

Returns

p-value – The two-sided p-value returned by a binomial test (scipy.stats.binom_test) If the allele frequencies dictionary is invalid, return np.nan If genotype alleles not included in frequencies dictionary, return np.nan

Return type

float

trtools.utils.utils.GetHeterozygosity(allele_freqs)

Compute heterozygosity of a locus

Heterozygosity is defined as the probability that two randomly drawn alleles are different.

Parameters

allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)

Returns

heterozygosity – The heterozygosity of the locus. If the allele frequencies dictionary is invalid, return np.nan

Return type

float

Notes

Heterzygosity is computed as:

\[H = 1-\sum_{i=1..n} p_i^2\]

where p_i is the frequency of allele i and n is the number of alleles.

Examples

>>> GetHeterozygosity({0:0.5, 1:0.5})
0.5
trtools.utils.utils.GetHomopolymerRun(seq)

Compute the maximum homopolymer run length in a sequence

Parameters

seq (str) – String giving a sequence of nucleotides

Returns

runlength – The length of the longest homopolymer run

Return type

int

Examples

>>> GetHomopolymerRun("AATAAAATAAAAAT")
5
trtools.utils.utils.GetMean(allele_freqs)

Compute the mean allele length

Parameters

allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles must be given in lengths (numbers, not strings)

Returns

mean – Return mean if allele frequencies dictionary is valid

Return type

float

Examples

>>> GetMean({0:0.5, 1:0.5})
0.5
trtools.utils.utils.GetMode(allele_freqs)

Compute the mode allele length.

If more than one allele has the maximum number of copies out of all alleles, choose one at random (but reproducibly)

Parameters

allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles must be given in lengths (numbers, not strings)

Returns

mode – Return mode if allele frequencies dictionary is valid

Return type

float

Examples

>>> GetMode({0:0.1, 1:0.9})
1
trtools.utils.utils.GetVariance(allele_freqs)

Compute the variance of the allele lengths

Parameters

allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles must be given in lengths (numbers, not strings)

Returns

variance – Return variance if allele frequencies dictionary is valid np.nan otherwise.

Return type

float

Examples

>>> GetVariance({0:1})
0
trtools.utils.utils.InferRepeatSequence(seq, period)

TODO change to dynamic programming approach Infer the repeated sequence in a string

Parameters
  • seq (str) – A string of nucleotides

  • period (int) – Length of the repeat unit

Returns

repseq – The inferred repeat unit (motif). If the input sequence is smaller than the period, repseq consists of all N’s.

Return type

str

Examples

>>> InferRepeatSequence('ATATATAT', 2)
'AT'
trtools.utils.utils.LoadReaders(vcf_locs, checkgz=True)

Return a list of VCF readers

Parameters
  • vcf_locs (List[str]) – A list of vcf locations

  • checkgz (bool) – Check if each VCF file is gzipped and indexed

Returns

readers – A list of VCF readers, or None if any of them could not be found. If checkgz, then also return None if any of the VCF readers were not gzipped and tabix indexed.

Return type

Optional[List[cvycf2.VCF]]

trtools.utils.utils.LoadSingleReader(vcf_loc, checkgz=True)

Return a VCF reader

Parameters
  • vcf_loc (str) – The location of the VCF file to read

  • checkgz (bool) – Check whether VCF file is gzipped and indexed, if not return None

Returns

reader – The cyvcf2.VCF instance, or None if the VCF is not present or could not be opened

Return type

Optional[cyvcf2.VCF]

trtools.utils.utils.LongestPerfectRepeat(seq, motif, check_reverse=True)

Determine the length (bp) of the longest perfect repeat stretch

Credit: function originally written by Helyaneh Ziaei-Jam

Parameters
  • seq (str) – Repeat region sequence

  • motif (str) – Repeat unit sequence

  • check_reverse (bool (optional)) – If False, don’t check reverse complement

Returns

max_match – Number of bp of longest perfect stretch

Return type

int

trtools.utils.utils.ReverseComplement(seq)

Get reverse complement of a sequence.

Converts everything to uppsercase.

Parameters

seq (str) – String of nucleotides.

Returns

revcompseq – Reverse complement of the input sequence.

Return type

str

Examples

>>> ReverseComplement("AGGCT")
'AGCCT'
trtools.utils.utils.ValidateAlleleFreqs(allele_freqs)

Check that the allele frequency distribution is valid.

Allele frequencies must sum to 1.

Parameters

allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)

Returns

is_valid – Return True if the distribution is valid, else False

Return type

bool

Examples

>>> ValidateAlleleFreqs({0:0.5, 1:0.5})
True