trtools.utils.utils
Util functions for calculating summary STR statistics and performing basic string operations on STR alleles.
- class trtools.utils.utils.ArgumentDefaultsHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)
Bases:
argparse.HelpFormatter
Build a custom argument parser that works just like argparse.ArgumentDefaultsHelpFormatter except that it doesn’t display None defaults. Everything below is copied from the python library source except for that.
Help message formatter which adds default values to argument help.
Only the name of this class is considered a public API. All the methods provided by the class are considered an implementation detail.
- trtools.utils.utils.FabricateAllele(motif, length)
Fabricate an allele with the given motif and length.
- Parameters
motif (str) – the motif to build the allele from
length (float) – the number of times to copy the motif (noninteger implies partial repeats). This does NOT specify the desired length of the returned string.
- Returns
the fabricated allele string
- Return type
str
Notes
The allele is fabricated with the given motif orientation (e.g. motif = ‘ACG’ could produce ‘ACGACGACG’ but not ‘CGACGACGA’). Fabricated alleles will contain no impurities nor flanking base pairs. In the case of length being a noninteger float (because of partial repeats) and where it is unclear if the last nucleotide should be included in the fabricated repeat or not due to imprecision in the length float, the last nucleotide will always be left off (the length of the returned string will always be rounded down).
- trtools.utils.utils.GetCanonicalMotif(repseq)
Get canonical STR sequence, considering both strands
The canonical sequence is the first alphabetically out of all possible rotations on + and - strands of the sequence. e.g. “TG” canonical sequence is “AC”.
- Parameters
repseq (str) – String giving a STR motif (repeat unit sequence)
- Returns
canon – The canonical sequence of the STR motif
- Return type
str
Examples
>>> GetCanonicalMotif("TG") 'AC'
- trtools.utils.utils.GetCanonicalOneStrand(repseq)
Get canonical STR sequence, considering one strand
The canonical sequence is the first alphabetically out of all possible rotations. e.g. CAG -> AGC.
- Parameters
repseq (str) – String giving a STR motif (repeat unit sequence)
- Returns
canon – The canonical sequence of the STR motif
- Return type
str
Examples
>>> GetCanonicalOneStrand("CAG") 'AGC'
- trtools.utils.utils.GetContigs(vcf)
Returns the contig IDs in the VCF.
- Parameters
vcf (cyvcf2.cyvcf2.VCF) – The vcf to get contigs from
- Returns
A list of contig IDs
- Return type
List[str]
- trtools.utils.utils.GetEntropy(allele_freqs)
Compute the (bit) entropy of a locus
Entropy is defined as the entropy <https://en.wikipedia.org/wiki/Information_content>_ of the distribution of allele frequencies.
- Parameters
allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)
- Returns
entropy – The entropy of the locus. If the allele frequencies dictionary is invalid, return np.nan
- Return type
float
Notes
Entropy is computed as:
\[E = -\sum_{i=1..n} -p_i*log_2(p_i)\]where p_i is the frequency of allele i and n is the number of alleles.
Examples
>>> GetEntropy({0:0.5, 1:0.5}) 1.0
- trtools.utils.utils.GetHardyWeinbergBinomialTest(allele_freqs, genotype_counts)
Compute Hardy Weinberg p-value
Tests whether the number of observed heterozygous vs. homozygous individuals is different than expected under Hardy Weinberg Equilibrium given the observed allele frequencies, based on a binomial test.
- Parameters
allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)
genotype_counts (dict of (object, object): int) – Dictionary of counts of each genotype. Genotypes are defined as tuples of alleles. Alleles must be the same as those given in allele_freqs
- Returns
p-value – The two-sided p-value returned by a binomial test (scipy.stats.binom_test) If the allele frequencies dictionary is invalid, return np.nan If genotype alleles not included in frequencies dictionary, return np.nan
- Return type
float
- trtools.utils.utils.GetHeterozygosity(allele_freqs)
Compute heterozygosity of a locus
Heterozygosity is defined as the probability that two randomly drawn alleles are different.
- Parameters
allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)
- Returns
heterozygosity – The heterozygosity of the locus. If the allele frequencies dictionary is invalid, return np.nan
- Return type
float
Notes
Heterzygosity is computed as:
\[H = 1-\sum_{i=1..n} p_i^2\]where p_i is the frequency of allele i and n is the number of alleles.
Examples
>>> GetHeterozygosity({0:0.5, 1:0.5}) 0.5
- trtools.utils.utils.GetHomopolymerRun(seq)
Compute the maximum homopolymer run length in a sequence
- Parameters
seq (str) – String giving a sequence of nucleotides
- Returns
runlength – The length of the longest homopolymer run
- Return type
int
Examples
>>> GetHomopolymerRun("AATAAAATAAAAAT") 5
- trtools.utils.utils.GetMean(allele_freqs)
Compute the mean allele length
- Parameters
allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles must be given in lengths (numbers, not strings)
- Returns
mean – Return mean if allele frequencies dictionary is valid
- Return type
float
Examples
>>> GetMean({0:0.5, 1:0.5}) 0.5
- trtools.utils.utils.GetMode(allele_freqs)
Compute the mode allele length.
If more than one allele has the maximum number of copies out of all alleles, choose one at random (but reproducibly)
- Parameters
allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles must be given in lengths (numbers, not strings)
- Returns
mode – Return mode if allele frequencies dictionary is valid
- Return type
float
Examples
>>> GetMode({0:0.1, 1:0.9}) 1
- trtools.utils.utils.GetVariance(allele_freqs)
Compute the variance of the allele lengths
- Parameters
allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles must be given in lengths (numbers, not strings)
- Returns
variance – Return variance if allele frequencies dictionary is valid np.nan otherwise.
- Return type
float
Examples
>>> GetVariance({0:1}) 0
- trtools.utils.utils.InferRepeatSequence(seq, period)
TODO change to dynamic programming approach Infer the repeated sequence in a string
- Parameters
seq (str) – A string of nucleotides
period (int) – Length of the repeat unit
- Returns
repseq – The inferred repeat unit (motif). If the input sequence is smaller than the period, repseq consists of all N’s.
- Return type
str
Examples
>>> InferRepeatSequence('ATATATAT', 2) 'AT'
- trtools.utils.utils.LoadReaders(vcf_locs, checkgz=True)
Return a list of VCF readers
- Parameters
vcf_locs (List[str]) – A list of vcf locations
checkgz (bool) – Check if each VCF file is gzipped and indexed
- Returns
readers – A list of VCF readers, or None if any of them could not be found. If checkgz, then also return None if any of the VCF readers were not gzipped and tabix indexed.
- Return type
Optional[List[cvycf2.VCF]]
- trtools.utils.utils.LoadSingleReader(vcf_loc, checkgz=True)
Return a VCF reader
- Parameters
vcf_loc (str) – The location of the VCF file to read
checkgz (bool) – Check whether VCF file is gzipped and indexed, if not return None
- Returns
reader – The cyvcf2.VCF instance, or None if the VCF is not present or could not be opened
- Return type
Optional[cyvcf2.VCF]
- trtools.utils.utils.LongestPerfectRepeat(seq, motif, check_reverse=True)
Determine the length (bp) of the longest perfect repeat stretch
Credit: function originally written by Helyaneh Ziaei-Jam
- Parameters
seq (str) – Repeat region sequence
motif (str) – Repeat unit sequence
check_reverse (bool (optional)) – If False, don’t check reverse complement
- Returns
max_match – Number of bp of longest perfect stretch
- Return type
int
- trtools.utils.utils.ReverseComplement(seq)
Get reverse complement of a sequence.
Converts everything to uppsercase.
- Parameters
seq (str) – String of nucleotides.
- Returns
revcompseq – Reverse complement of the input sequence.
- Return type
str
Examples
>>> ReverseComplement("AGGCT") 'AGCCT'
- trtools.utils.utils.ValidateAlleleFreqs(allele_freqs)
Check that the allele frequency distribution is valid.
Allele frequencies must sum to 1.
- Parameters
allele_freqs (dict of object: float) – Dictionary of allele frequencies for each allele. Alleles are typically strings (sequences) or floats (num. repeats)
- Returns
is_valid – Return True if the distribution is valid, else False
- Return type
bool
Examples
>>> ValidateAlleleFreqs({0:0.5, 1:0.5}) True