trtools.utils.tr_harmonizer module
Utilities for harmonizing tandem repeat VCF records.
Handles VCFs generated by various TR genotyping tools
- trtools.utils.tr_harmonizer.HarmonizeRecord(vcftype, vcfrecord)
Create a standardized TRRecord object out of a cyvcf2.Variant object of possibly unknown type.
- trtools.utils.tr_harmonizer.HasLengthAltGenotypes(vcftype)
Determine if the alt alleles of variants are given by length.
If True, then alt alleles for all variants produced by this caller are specified by length and not by sequence. Sequences are fabricated according to
trtools.utils.utils.FabricateAllele()
.- Returns
Indicates whether alt alleles are specified by length
- Return type
bool
- Parameters
vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) –
- trtools.utils.tr_harmonizer.HasLengthRefGenotype(vcftype)
Determine if the reference alleles of variants are given by length.
If True, then reference alleles for all variants produced by this caller are specified by length and not by sequence. Sequences are fabricated according to
trtools.utils.utils.FabricateAllele()
.If True, then
HasLengthAltGenotypes()
will also be true- Returns
Indicates whether ref alleles are specified by length
- Return type
bool
- Parameters
vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) –
- trtools.utils.tr_harmonizer.InferVCFType(vcffile, vcftype='auto')
Infer the genotyping tool used to create the VCF.
When we can, infer from header metadata. Otherwise, try to infer the type from the ALT field.
- Parameters
vcffile (cyvcf2.cyvcf2.VCF) – The input VCF file
vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) – If it is unclear which of a few VCF callers produced the underlying VCF (because the output markings of those VCF callers are similar) this string can be supplied by the user to choose from among those callers.
- Returns
vcftype – Type of the VCF file
- Return type
- Raises
TypeError – If this VCF does not look like it was produced by any supported TR caller, or if it looks like it could have been produced by more than one supported TR caller and vcftype == ‘auto’, or if vcftype doesn’t match any of the callers that could have produced this VCF.
- trtools.utils.tr_harmonizer.IsBeagleVCF(vcffile)
Is this a VCF produced by running the Beagle software to impute STRs from a panel generated by an TR genotyper, or does it consist of STRs directly called by a TR genotyper?
- Parameters
vcffile (cyvcf2.cyvcf2.VCF) – The input VCF file
- Returns
Whether this is a VCF produced by Beagle
- Return type
bool
- trtools.utils.tr_harmonizer.MayHaveImpureRepeats(vcftype)
Determine if any of the alleles in this VCF may contain impure repeats.
Specifically, impure repeats include:
impurities in the underlying sequence (e.g. AAATAAAAA)
partial repeats (e.g. AATAATAATAA)
This is a guarantee that the caller attempted to call impure repeats, not that it found any. It also does not guarantee that all impurities present were identified and called.
- Returns
Indicates whether repeat sequences may be impure
- Return type
bool
- Parameters
vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) –
- class trtools.utils.tr_harmonizer.TRDosageTypes(value)
Bases:
enum.Enum
Ways to compute TR dosages.
- beagleap = 'beagleap'
- beagleap_norm = 'beagleap_norm'
- bestguess = 'bestguess'
- bestguess_norm = 'bestguess_norm'
- class trtools.utils.tr_harmonizer.TRRecord(vcfrecord, ref_allele, alt_alleles, motif, record_id, quality_field, *, harmonized_pos=None, full_alleles=None, ref_allele_length=None, alt_allele_lengths=None, quality_score_transform=None)
Bases:
object
A representation of a VCF record specialized for TR fields.
Allows downstream functions to be agnostic to the genotyping tool used to create the record.
- Parameters
vcfrecord (cyvcf2.cyvcf2.Variant) – Cyvcf2 Variant object with the underlying data
ref_allele (Optional[str]) – Reference allele string
alt_alleles (Optional[List[str]]) – List of alternate allele strings
motif (str) – Repeat unit
record_id (str) – Identifier for the record
quality_field (Optional[str]) – the name of the FORMAT field which contains the quality score for each call for this record
harmonized_pos (Optional[int]) –
full_alleles (Optional[Tuple[str, List[str]]]) –
ref_allele_length (Optional[float]) –
alt_allele_lengths (Optional[List[float]]) –
quality_score_transform (Optional[Callable[[...], float]]) –
- vcfrecord
The cyvcf2 Variant object used to init this record.
- Type
cyvcf2.Variant
- ref_allele
Reference allele sequences, fabricated if necessary. Gets converted to uppercase e.g. ACGACGACG
- Type
str
- alt_alleles
List of alternate allele sequences, fabricated if necessary
- Type
List[str]
- motif
Repeat unit
- Type
str
- record_id
Identifier for the record
- Type
str
- chrom
The chromosome this locus is in
- Type
str
- pos
The bp along the chromosome that this locus is at (ignoring flanking base pairs/full alleles)
- Type
int
- end_pos
Position of the last bp of ref allele (ignoring flanking base pairs/full alleles)
- full_alleles_pos
Position of the first bp of the full ref allele (including the flanking base pairs)
- full_alleles_end_pos
Position of the last bp of the full ref allele (including the flanking base pairs)
- info
The dictionary of INFO fields at this locus
- Type
Dict[str, Any]
- format
The dictionary of FORMAT fields at this locus. Numeric format fields are 2D numpy arrays with rows corresponding to samples (normally 1 column, but if there are multiple numbers then more than one column) String format fields are 1D numpy arrays with entries corresponding to samples
- Type
Dict[str, np.ndarray]
- Parameters
harmonized_pos (Optional[int]) – If this record has flanking base pairs before the repeat, set this to note at which bp the repeat begins
full_alleles (Optional[Tuple[str, List[str]]]) – A tuple of string genotypes (ref_allele, [alt_alleles]) where each allele may contain any number of flanking basepairs in addition to containing the tandem repeat. If set, these can be accessed through
GetFullStringGenotypes()
If the alt alleles have differently sized flanks than the ref allele then those alt alleles will be improperly trimmed.alt_allele_lengths (Optional[List[float]]) –
The lengths of each of the alt alleles, in order. Thus is measured in number of copies of repeat unit, NOT the allele length in base pairs.
Should be passed to the constructor when only the lengths of the alt alleles were measured and not the sequences. If sequences are passed to the constructor then this is set automatically.
If this is passed, the alt_alleles parameter to the constructor must be set to None and the alt_alleles attribute of the record will be set to fabricated alleles (see
trtools.utils.utils.FabricateAllele()
)ref_allele_length (Optional[float]) – like alt_allele_lengths, but for the reference allele. If this is passed, alt_allele_lengths must also be passed
min_allele_length – Minimum allele length from the reference and alternate alleles
max_allele_length – Maximum allele length from the reference and alternate alleles
quality_score_transform (Optional[Callable[[...], float]]) – A function which turns the quality_field value into a float score. When None, the quality_field values are assumed to already be floats
vcfrecord (cyvcf2.cyvcf2.Variant) –
ref_allele (Optional[str]) –
alt_alleles (Optional[List[str]]) –
motif (str) –
record_id (str) –
quality_field (Optional[str]) –
Notes
Alleles are stored as upper case strings with all the repeats written out. Alleles may contain partial repeat copies or impurities. This class will attempt to make sure alleles do not contain any extra base pairs to either side of the repeat. If you wish to have those base pairs, use the ‘Full’ methods
- GetAlleleCounts(sample_index=None, *, uselength=True, index=False, fullgenotypes=False)
Get the counts of each allele for a record.
This does not return counts of no calls as it is not clear how many ‘no call alleles’ would be present per no call
Alleles that are not called in any sample are not present in the returned dictionary
- Parameters
sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.
uselength (bool, optional) – If True, represent alleles a lengths else represent as strings
index (bool) – If True, represent alleles as indexes (0 = ref, 1 = first_alt, etc.) instead of sequences or lengths
fullgenotypes (bool) – If True, include flanking basepairs in allele representations Only makes sense when expliictly stating uselength=False. Cannot be combined with index.
- Returns
allele_counts – Gives the count of each allele. The type of allele representation is determined by the uselength, index and fullgenotypes optional parameters.
- Return type
Dict[Any, int]
- GetAlleleFreqs(sample_index=None, *, uselength=True, index=False, fullgenotypes=False)
Get the frequencies of each allele for a record.
- Parameters
sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.
uselength (bool) – If True, represent alleles a lengths else represent as strings
index (bool) – If True, represent alleles as indexes (0 = ref, 1 = first_alt, etc.) instead of sequences or lengths
fullgenotypes (bool) – If True, include flanking basepairs in allele representations. Only makes sense when expliictly stating uselength=False. Cannot be combined with index.
- Returns
allele_freqs – Gives the frequency of each allele among called samples The type of allele representation is determined by the uselength, index and fullgenotypes optional parameters.
- Return type
Dict[Any, float]
- GetCallRate(strict=True)
Return the call rate at this locus.
- Parameters
strict (bool) – By default genotypes such as ‘1/.’ are considered not called because at least one of the haplotypes present is not called. Set strict = False to mark these as being called. Note: genotypes having lesser ploidy will not be marked as no calls even when strict = True (e.g. if some samples have tetraploid genotypes at this locus, a genotype of ‘1/2/2’ will be marked as called even though it is triploid)
- Returns
The fraction of the samples at this locus that have been
called. If there are no samples in the vcf this record comes from
then return np.nan instead
- Return type
float
- GetCalledSamples(strict=True)
Get an array listing which samples have been called at this locus.
- Parameters
strict (bool) – By default genotypes such as ‘1/.’ are considered not called because at least one of the haplotypes present is not called. Set strict = False to mark these as being called. Note: genotypes having lesser ploidy will not be marked as no calls even when strict = True (e.g. if some samples have tetraploid genotypes at this locus, a genotype of ‘1/2/2’ will be marked as called even though it is triploid)
- Returns
A bool array of length equal to the number of samples, where true indicates a sample has been called and false indicates otherwise. If there are no samples in the vcf this record comes from then return None instead
- Return type
Optional[np.ndarray]
- GetDosages(dosagetype=TRDosageTypes.bestguess, strict=True)
Get an array of genotype dosages for each sample.
Multiple strategies are used to compute dosages:
bestguess - Sum of the length (in num. rpt units) of alleles
- beagleap - For each haplotype, dosage is computed as:
sum_a len(a) x p(a) where len(a) is the length (in rpt. units) of each allele a, and p(a) is the allele probability (from Beagle AP1/AP2 fields) The total dosage is this value summed across the two haplotypes
bestguess_norm - Same as bestguess but scaled to be between 0 and 2
beagleap_norm - Same as beagleap but scaled to be between 0 and 2
Note: normalized dosages currently not supported for haploid calls Those are set to np.nan in the output dosages if using a _norm option
- Parameters
dosagetype (Enum) – Which TRDosageType to compute. Default bestguess
strict (bool) – If False, output a warning but do not die on errors validating AP field If errors are encountered, return nan dosage values
- Returns
dosages – A numpy array of dosages, of type float If no samples are in the array, return None
- Return type
npt.NDArray[np.float32]
- GetFullStringGenotypes()
Get an array of full string genotypes for each sample. See
GetStringGenotypes()
for details and limitations of string genotypes.If the sample does not have full genotypes that are distinct from its regular string genotypes (because no flanking base pairs were called) then the regular string genotypes are returned.
- Returns
The numpy array described above, of type ‘<UN’ where ‘N’ is the max allele length. If there are no samples in the vcf this record comes from then return None instead
- Return type
Optional[np.ndarray]
- GetGenotypeCounts(sample_index=None, uselength=True, index=False, fullgenotypes=False, include_nocalls=False)
Get the counts of each genotype for a record.
For samples with a lower ploidy than the max ploidy among all samples, the -2 placeholder haplotypes are sorted to the beginning of the call (e.g. (-2, 5) instead of (5, -2))
This currently returns unphased genotypes (with no phasing column), it could be extend to have an option to respect phasing
- Parameters
sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.
uselength (bool) – If True, represent alleles as lengths else represent as strings
index (bool) – If True, represent alleles as indexes (0 = ref, 1 = first_alt, etc.) instead of sequences or lengths
fullgenotypes (bool) – If True, include flanking basepairs in allele representations. Only makes sense when expliictly stating uselength=False. Cannot be combined with index.
include_nocalls (bool) – If False, all genotypes with one or more uncalled haplotypes (-1 or ‘.’) are excluded from the returned dictionary, they are included if True. Genotypes with lower ploidy (-2 or ‘,’) are included regardless.
- Returns
genotype_counts – Gives the count of each genotype. Genotypes are represented as tuples of alleles, where the type of allele representation is determined by the uselength, index and fullgenotypes optional parameters.
- Return type
Dict[tuple, int]
- GetGenotypeIndicies()
Get an array of genotype indicies across all samples.
A genotype index is a number 0, 1, 2 … where 0 corresponds to the reference allele, 1 to the first alt allele, 2 to the second, etc. The array is an array of ints with one row per sample. The number of columns is the maximum ploidy of any sample (normally 2) plus 1 for phasing. All but the final column represent the index of the genotypes of each call. The final column has values 0 for unphased sampels or 1 for phased. So a sample with gt ‘0|2’ would be represented by the row [0, 2, 1] and a sample with gt ‘3/0’ would be represented by the row [3, 0, 0]. Uncalled haplotypes (represented by ‘.’ in the VCF) are represented by ‘-1’ genotypes. If the sample has fewer haplotypes than the maximum ploidy of all samples at this locus, then the row is padded with -2s, so a haploid sample with gt ‘1’ where other samples are diploid would be represented by the row [1, -2, 0]. If all the genotype columns for a sample are negative then the sample is a no call. Note: the value of the phasing column is unspecified for haploid or no-call samples.
- Returns
The numpy array described above, of type int. If there are no samples in the vcf this record comes from then return None instead
- Return type
Optional[np.ndarray]
- GetLengthGenotypes()
Get an array of length genotypes for each sample.
Represents the sample’s genotype in terms of the number of repeats of the motif in each allele. Returns a pair of floats - alleles including partial repeats or other impurities may have noninteger lengths.
The array is as described in
GetGenotypeIndicies()
except that indicies are replaced by their length genotypes. -1s, -2s and the phasing bits are not modified.For records with both regular and full sequences (those with flanking bps), this returns the length of the regular sequences
- Returns
The numpy array described above, of type float If there are no samples in the vcf this record comes from then return None instead
- Return type
Optional[np.ndarray]
- GetMaxAllele(sample_index=None)
Get the maximum allele length called in a record.
Represents lengths in terms of the number of repeats of the motif. The longest allele may have a noninteger length if it includes partial repeats or other impurities.
For records with both regular and full sequences (those with flanking bps), this returns the length of the regular sequences
- Parameters
sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.
- Returns
maxallele – The maximum allele length called (in number of repeat units), or nan if no alleles called
- Return type
float
- GetMaxPloidy()
Return the maximum ploidy of any sample at this locus.
All genotypes will be a tuple of that many haplotypes, where samples with a smaller ploidy than that will have haplotypes at the end of the tuple set to ‘,’ (for string genotypes) or -2 (for index or length genotypes)
- Return type
int
- GetNumSamples()
Return the number of samples at this locus (called or not).
Same as the number of samples in the overall vcf
- Return type
int
- GetQualityScores()
Get the quality scores of the calls for each sample.
The meaning and reliability of these scores is genotyper dependent, see the doc section Quality Scores.
- Returns
An array of quality score floats, one row per sample Samples which were not called have the value np.nan
- Return type
np.ndarray
- GetSamplePloidies()
Get an array listing the ploidies of each sample
- Returns
An array of positive ints with length equal to the number of samples where each entry denotes the number of genotypes for each sample at this locus (including no calls) If there are no samples in the vcf this record comes from then return None instead
- Return type
Optional[np.ndarray]
- GetStringGenotypes()
Get an array of string genotypes for each sample.
The array is as described in
GetGenotypeIndicies()
except that the indicies are replaced by their corresponding sequences, -1 indicies (nocalls) are replaced by ‘.’, -2 indicies (not present due to smaller ploidy) are replaced by ‘,’, and the phasing bits (0 or 1) are replaced by the strings ‘0’ or ‘1’.Will not include flanking base pairs. To get genotypes that include flanking base pairs (for callers that call those), use
GetFullStringGenotypes()
. For callers that include flanking base pairs it is possible that some of the alleles in the regular string genotypes (with the flanks stripped) will be identical. In this case, you may useUniqueStringGenotypeMapping()
to get a canonical unique subset of indicies which represent all possible alleles.Note that some TR callers will only call allele lengths, not allele sequences. In such a case, this method will return a fabricated sequence based on the called length (see
trtools.utils.utils.FabricateAllele()
) and a warning will be raised. This may not be intended - useGetLengthGenotypes()
for a fully caller agnostic way of handling genotypes.This method is inefficient for many samples, consider either using length genotypes (
GetLengthGenotypes()
), or using genotype indicies (GetGenotypeIndicies()
) and accessing string genotypes as needed through the fields ref_allele and alt_alleles, instead.- Returns
The numpy array described above, of type ‘<UN’ where ‘N’ is the max allele length. If there are no samples in the vcf this record comes from then return None instead
- Return type
Optional[np.ndarray]
- HasFabricatedAltAlleles()
Determine if this record has fabricated alt alleles.
- Returns
True iff alt_allele_lengths was passed to this record’s constructor.
- Return type
bool
- HasFabricatedRefAllele()
Determine if this record has a fabricated ref allels.
- Returns
True iff ref_allele_length was passed to this record’s constructor.
- Return type
bool
- HasFullStringGenotypes()
Determine if this record has full string genotypes.
- Returns
True iff
GetFullStringGenotypes()
will return a different value thanGetStringGenotypes()
for some alleles.- Return type
bool
- HasQualityScores()
Does this TRRecord contain quality scores for each of its calls? If present, the meaning and reliability of these scores is genotyper dependent, see the doc section Quality Scores.
- Returns
Whether or not a FORMAT field that could be interpreted as a quality score has been identified
- Return type
boolean
- UniqueLengthGenotypeMapping()
Get a mapping whose values are unique string genotype indicies.
- Returns
genotypeMapping – A mapping allele idx -> allele idx whose keys are all allele indicies and whose values are a subset of indicies which represents all the unique length alleles for this variant. For variants where multiple alleles have the same length, all will map to a single index from among those alleles.
- Return type
Dict[int, int]
- UniqueLengthGenotypes()
Find allele indicies corresponding to the unique length alleles.
Equivalent to calling
set(UniqueLengthGenotypeMapping().values())
- Returns
The indicies of the unique string alleles
- Return type
Set[int]
- UniqueStringGenotypeMapping()
Get a mapping whose values are unique string genotype indicies.
- Returns
A mapping allele idx -> allele idx whose keys are all allele indicies and whose values are a subset of indicies which represents all the unique regular string alleles for this variant. For almost all records, this will be a mapping from each index to itself. For some records with full string genotypes that include flanking base pairs, some of the regular string alleles will be identical. In this case, only one of those allele’s indicies will be in the set of values of this dictionary, and all identical alleles will map to that one index.
- Return type
Dict[int, int]
- UniqueStringGenotypes()
Find allele indicies corresponding to the unique alleles.
Equivalent to calling
set(UniqueStringGenotypeMapping().values())
- Returns
The indicies of the unique string alleles
- Return type
Set[int]
- _CheckRecord()
Check that this record is properly constructed.
Checks that the same number of alt alleles were specified as in the underlying record and that the full_alleles, if supplied, contain their corresponding standard alleles
Raises an error if a check fails
- class trtools.utils.tr_harmonizer.TRRecordHarmonizer(vcffile, vcftype='auto')
Bases:
object
Class producing a uniform interface for accessing TR VCF records.
Produces the same output interface regardless of the tool that created the input VCF.
The main purpose of this class is to infer which tool a VCF came from, and appropriately convert its records to TRRecord objects.
This class provides the object oriented paradigm for iterating through a TR vcf. If you wish to use the functional paradigm and provide the cyvcf2.Variant objects yourself, use the top-level functions in this module.
- Parameters
vcffile (cyvcf2.VCF instance) –
vcftype ({'auto', 'gangstr', 'advntr', 'hipstr', 'eh', 'popstr'}, optional) – Type of the VCF file. Default=’auto’. If vcftype==’auto’, attempts to infer the type.
- vcffile
- Type
cyvcf2.VCF instance
- vcftype
Type of the VCF file. Must be included in VcfTypes
- Type
enum
- Raises
TypeError – If the type of the VCF cannot be properly inferred. See
InferVCFType()
for more details.- Parameters
vcffile (cyvcf2.cyvcf2.VCF) –
vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) –
- HasLengthAltGenotypes()
Determine if the alt alleles of variants are given by length.
See also
tr_harmonizer.HasLengthAltGenotypes
- Return type
bool
- HasLengthRefGenotype()
Determine if the reference alleles of variants are given by length.
See also
tr_harmonizer.HasLengthRefGenotype
- Return type
bool
- HasQualityScore()
Does this VCF contain quality scores for each of its calls? If present, the meaning and reliability of these scores is genotyper dependent, see the doc section Quality Scores.
- Returns
Whether or not a FORMAT field that could be interpreted as a quality score has been identified
- Return type
bool
- IsBeagleVCF()
Is this a VCF produced by running the Beagle software to impute STRs from a panel generated by an TR genotyper?
See also
tr_harmonizer.IsBeagleVCF
- Return type
bool
- MayHaveImpureRepeats()
Determine if any of the alleles in this VCF may contain impure repeats.
See also
tr_harmonizer.MayHaveImpureRepeats
- Return type
bool
- class trtools.utils.tr_harmonizer.VcfTypes(value)
Bases:
enum.Enum
The different tr callers that tr_harmonizer supports.
- advntr = 'advntr'
- eh = 'eh'
- gangstr = 'gangstr'
- hipstr = 'hipstr'
- longtr = 'longtr'
- popstr = 'popstr'
- class trtools.utils.tr_harmonizer._Cyvcf2FormatDict(record)
Bases:
object
Provide an immutable dict-like interface for accessing format fields from a cyvcf2 record. To iterate over this dict, use
iter(this)
orthis.keys()
.- Parameters
record (cyvcf2.cyvcf2.Variant) –
- get(key)
- Parameters
key (str) –
- keys()
- trtools.utils.tr_harmonizer._HarmonizeAdVNTRRecord(vcfrecord)
Turn a cyvcf2.Variant with adVNTR content into a TRRecord.
- Parameters
vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object
- Returns
- Return type
- trtools.utils.tr_harmonizer._HarmonizeEHRecord(vcfrecord)
Turn a cyvcf2.Variant with EH content into a TRRecord.
- Parameters
vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object
- Returns
- Return type
- trtools.utils.tr_harmonizer._HarmonizeGangSTRRecord(vcfrecord)
Turn a cyvcf2.Variant with GangSTR content into a TRRecord.
- Parameters
vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object
- Returns
- Return type
- trtools.utils.tr_harmonizer._HarmonizeHipSTRRecord(vcfrecord)
Turn a cyvcf2.Variant with HipSTR content into a TRRecord.
- Parameters
vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object
- Returns
- Return type