trtools.utils.tr_harmonizer module

Utilities for harmonizing tandem repeat VCF records.

Handles VCFs generated by various TR genotyping tools

trtools.utils.tr_harmonizer.HarmonizeRecord(vcftype, vcfrecord)

Create a standardized TRRecord object out of a cyvcf2.Variant object of possibly unknown type.

Parameters
  • vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object

  • vcftype (VcfTypes) – Type of the VCF file

Returns

trrecord – A TRRecord object built out of the input record

Return type

TRRecord

trtools.utils.tr_harmonizer.HasLengthAltGenotypes(vcftype)

Determine if the alt alleles of variants are given by length.

If True, then alt alleles for all variants produced by this caller are specified by length and not by sequence. Sequences are fabricated according to trtools.utils.utils.FabricateAllele().

Returns

Indicates whether alt alleles are specified by length

Return type

bool

Parameters

vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) –

trtools.utils.tr_harmonizer.HasLengthRefGenotype(vcftype)

Determine if the reference alleles of variants are given by length.

If True, then reference alleles for all variants produced by this caller are specified by length and not by sequence. Sequences are fabricated according to trtools.utils.utils.FabricateAllele().

If True, then HasLengthAltGenotypes() will also be true

Returns

Indicates whether ref alleles are specified by length

Return type

bool

Parameters

vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) –

trtools.utils.tr_harmonizer.InferVCFType(vcffile, vcftype='auto')

Infer the genotyping tool used to create the VCF.

When we can, infer from header metadata. Otherwise, try to infer the type from the ALT field.

Parameters
  • vcffile (cyvcf2.cyvcf2.VCF) – The input VCF file

  • vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) – If it is unclear which of a few VCF callers produced the underlying VCF (because the output markings of those VCF callers are similar) this string can be supplied by the user to choose from among those callers.

Returns

vcftype – Type of the VCF file

Return type

VcfTypes

Raises

TypeError – If this VCF does not look like it was produced by any supported TR caller, or if it looks like it could have been produced by more than one supported TR caller and vcftype == ‘auto’, or if vcftype doesn’t match any of the callers that could have produced this VCF.

trtools.utils.tr_harmonizer.IsBeagleVCF(vcffile)

Is this a VCF produced by running the Beagle software to impute STRs from a panel generated by an TR genotyper, or does it consist of STRs directly called by a TR genotyper?

Parameters

vcffile (cyvcf2.cyvcf2.VCF) – The input VCF file

Returns

Whether this is a VCF produced by Beagle

Return type

bool

trtools.utils.tr_harmonizer.MayHaveImpureRepeats(vcftype)

Determine if any of the alleles in this VCF may contain impure repeats.

Specifically, impure repeats include:

  • impurities in the underlying sequence (e.g. AAATAAAAA)

  • partial repeats (e.g. AATAATAATAA)

This is a guarantee that the caller attempted to call impure repeats, not that it found any. It also does not guarantee that all impurities present were identified and called.

Returns

Indicates whether repeat sequences may be impure

Return type

bool

Parameters

vcftype (Union[str, trtools.utils.tr_harmonizer.VcfTypes]) –

class trtools.utils.tr_harmonizer.TRDosageTypes(value)

Bases: enum.Enum

Ways to compute TR dosages.

beagleap = 'beagleap'
beagleap_norm = 'beagleap_norm'
bestguess = 'bestguess'
bestguess_norm = 'bestguess_norm'
class trtools.utils.tr_harmonizer.TRRecord(vcfrecord, ref_allele, alt_alleles, motif, record_id, quality_field, *, harmonized_pos=None, full_alleles=None, ref_allele_length=None, alt_allele_lengths=None, quality_score_transform=None)

Bases: object

A representation of a VCF record specialized for TR fields.

Allows downstream functions to be agnostic to the genotyping tool used to create the record.

Parameters
  • vcfrecord (cyvcf2.cyvcf2.Variant) – Cyvcf2 Variant object with the underlying data

  • ref_allele (Optional[str]) – Reference allele string

  • alt_alleles (Optional[List[str]]) – List of alternate allele strings

  • motif (str) – Repeat unit

  • record_id (str) – Identifier for the record

  • quality_field (Optional[str]) – the name of the FORMAT field which contains the quality score for each call for this record

  • harmonized_pos (Optional[int]) –

  • full_alleles (Optional[Tuple[str, List[str]]]) –

  • ref_allele_length (Optional[float]) –

  • alt_allele_lengths (Optional[List[float]]) –

  • quality_score_transform (Optional[Callable[[...], float]]) –

vcfrecord

The cyvcf2 Variant object used to init this record.

Type

cyvcf2.Variant

ref_allele

Reference allele sequences, fabricated if necessary. Gets converted to uppercase e.g. ACGACGACG

Type

str

alt_alleles

List of alternate allele sequences, fabricated if necessary

Type

List[str]

motif

Repeat unit

Type

str

record_id

Identifier for the record

Type

str

chrom

The chromosome this locus is in

Type

str

pos

The bp along the chromosome that this locus is at (ignoring flanking base pairs/full alleles)

Type

int

end_pos

Position of the last bp of ref allele (ignoring flanking base pairs/full alleles)

full_alleles_pos

Position of the first bp of the full ref allele (including the flanking base pairs)

full_alleles_end_pos

Position of the last bp of the full ref allele (including the flanking base pairs)

info

The dictionary of INFO fields at this locus

Type

Dict[str, Any]

format

The dictionary of FORMAT fields at this locus. Numeric format fields are 2D numpy arrays with rows corresponding to samples (normally 1 column, but if there are multiple numbers then more than one column) String format fields are 1D numpy arrays with entries corresponding to samples

Type

Dict[str, np.ndarray]

Parameters
  • harmonized_pos (Optional[int]) – If this record has flanking base pairs before the repeat, set this to note at which bp the repeat begins

  • full_alleles (Optional[Tuple[str, List[str]]]) – A tuple of string genotypes (ref_allele, [alt_alleles]) where each allele may contain any number of flanking basepairs in addition to containing the tandem repeat. If set, these can be accessed through GetFullStringGenotypes() If the alt alleles have differently sized flanks than the ref allele then those alt alleles will be improperly trimmed.

  • alt_allele_lengths (Optional[List[float]]) –

    The lengths of each of the alt alleles, in order. Thus is measured in number of copies of repeat unit, NOT the allele length in base pairs.

    Should be passed to the constructor when only the lengths of the alt alleles were measured and not the sequences. If sequences are passed to the constructor then this is set automatically.

    If this is passed, the alt_alleles parameter to the constructor must be set to None and the alt_alleles attribute of the record will be set to fabricated alleles (see trtools.utils.utils.FabricateAllele())

  • ref_allele_length (Optional[float]) – like alt_allele_lengths, but for the reference allele. If this is passed, alt_allele_lengths must also be passed

  • min_allele_length – Minimum allele length from the reference and alternate alleles

  • max_allele_length – Maximum allele length from the reference and alternate alleles

  • quality_score_transform (Optional[Callable[[...], float]]) – A function which turns the quality_field value into a float score. When None, the quality_field values are assumed to already be floats

  • vcfrecord (cyvcf2.cyvcf2.Variant) –

  • ref_allele (Optional[str]) –

  • alt_alleles (Optional[List[str]]) –

  • motif (str) –

  • record_id (str) –

  • quality_field (Optional[str]) –

Notes

Alleles are stored as upper case strings with all the repeats written out. Alleles may contain partial repeat copies or impurities. This class will attempt to make sure alleles do not contain any extra base pairs to either side of the repeat. If you wish to have those base pairs, use the ‘Full’ methods

GetAlleleCounts(sample_index=None, *, uselength=True, index=False, fullgenotypes=False)

Get the counts of each allele for a record.

This does not return counts of no calls as it is not clear how many ‘no call alleles’ would be present per no call

Alleles that are not called in any sample are not present in the returned dictionary

Parameters
  • sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.

  • uselength (bool, optional) – If True, represent alleles a lengths else represent as strings

  • index (bool) – If True, represent alleles as indexes (0 = ref, 1 = first_alt, etc.) instead of sequences or lengths

  • fullgenotypes (bool) – If True, include flanking basepairs in allele representations Only makes sense when expliictly stating uselength=False. Cannot be combined with index.

Returns

allele_counts – Gives the count of each allele. The type of allele representation is determined by the uselength, index and fullgenotypes optional parameters.

Return type

Dict[Any, int]

GetAlleleFreqs(sample_index=None, *, uselength=True, index=False, fullgenotypes=False)

Get the frequencies of each allele for a record.

Parameters
  • sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.

  • uselength (bool) – If True, represent alleles a lengths else represent as strings

  • index (bool) – If True, represent alleles as indexes (0 = ref, 1 = first_alt, etc.) instead of sequences or lengths

  • fullgenotypes (bool) – If True, include flanking basepairs in allele representations. Only makes sense when expliictly stating uselength=False. Cannot be combined with index.

Returns

allele_freqs – Gives the frequency of each allele among called samples The type of allele representation is determined by the uselength, index and fullgenotypes optional parameters.

Return type

Dict[Any, float]

GetCallRate(strict=True)

Return the call rate at this locus.

Parameters

strict (bool) – By default genotypes such as ‘1/.’ are considered not called because at least one of the haplotypes present is not called. Set strict = False to mark these as being called. Note: genotypes having lesser ploidy will not be marked as no calls even when strict = True (e.g. if some samples have tetraploid genotypes at this locus, a genotype of ‘1/2/2’ will be marked as called even though it is triploid)

Returns

  • The fraction of the samples at this locus that have been

  • called. If there are no samples in the vcf this record comes from

  • then return np.nan instead

Return type

float

GetCalledSamples(strict=True)

Get an array listing which samples have been called at this locus.

Parameters

strict (bool) – By default genotypes such as ‘1/.’ are considered not called because at least one of the haplotypes present is not called. Set strict = False to mark these as being called. Note: genotypes having lesser ploidy will not be marked as no calls even when strict = True (e.g. if some samples have tetraploid genotypes at this locus, a genotype of ‘1/2/2’ will be marked as called even though it is triploid)

Returns

A bool array of length equal to the number of samples, where true indicates a sample has been called and false indicates otherwise. If there are no samples in the vcf this record comes from then return None instead

Return type

Optional[np.ndarray]

GetDosages(dosagetype=TRDosageTypes.bestguess, strict=True)

Get an array of genotype dosages for each sample.

Multiple strategies are used to compute dosages:

  • bestguess - Sum of the length (in num. rpt units) of alleles

  • beagleap - For each haplotype, dosage is computed as:

    sum_a len(a) x p(a) where len(a) is the length (in rpt. units) of each allele a, and p(a) is the allele probability (from Beagle AP1/AP2 fields) The total dosage is this value summed across the two haplotypes

  • bestguess_norm - Same as bestguess but scaled to be between 0 and 2

  • beagleap_norm - Same as beagleap but scaled to be between 0 and 2

Note: normalized dosages currently not supported for haploid calls Those are set to np.nan in the output dosages if using a _norm option

Parameters
  • dosagetype (Enum) – Which TRDosageType to compute. Default bestguess

  • strict (bool) – If False, output a warning but do not die on errors validating AP field If errors are encountered, return nan dosage values

Returns

dosages – A numpy array of dosages, of type float If no samples are in the array, return None

Return type

npt.NDArray[np.float32]

GetFullStringGenotypes()

Get an array of full string genotypes for each sample. See GetStringGenotypes() for details and limitations of string genotypes.

If the sample does not have full genotypes that are distinct from its regular string genotypes (because no flanking base pairs were called) then the regular string genotypes are returned.

Returns

The numpy array described above, of type ‘<UN’ where ‘N’ is the max allele length. If there are no samples in the vcf this record comes from then return None instead

Return type

Optional[np.ndarray]

GetGenotypeCounts(sample_index=None, uselength=True, index=False, fullgenotypes=False, include_nocalls=False)

Get the counts of each genotype for a record.

For samples with a lower ploidy than the max ploidy among all samples, the -2 placeholder haplotypes are sorted to the beginning of the call (e.g. (-2, 5) instead of (5, -2))

This currently returns unphased genotypes (with no phasing column), it could be extend to have an option to respect phasing

Parameters
  • sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.

  • uselength (bool) – If True, represent alleles as lengths else represent as strings

  • index (bool) – If True, represent alleles as indexes (0 = ref, 1 = first_alt, etc.) instead of sequences or lengths

  • fullgenotypes (bool) – If True, include flanking basepairs in allele representations. Only makes sense when expliictly stating uselength=False. Cannot be combined with index.

  • include_nocalls (bool) – If False, all genotypes with one or more uncalled haplotypes (-1 or ‘.’) are excluded from the returned dictionary, they are included if True. Genotypes with lower ploidy (-2 or ‘,’) are included regardless.

Returns

genotype_counts – Gives the count of each genotype. Genotypes are represented as tuples of alleles, where the type of allele representation is determined by the uselength, index and fullgenotypes optional parameters.

Return type

Dict[tuple, int]

GetGenotypeIndicies()

Get an array of genotype indicies across all samples.

A genotype index is a number 0, 1, 2 … where 0 corresponds to the reference allele, 1 to the first alt allele, 2 to the second, etc. The array is an array of ints with one row per sample. The number of columns is the maximum ploidy of any sample (normally 2) plus 1 for phasing. All but the final column represent the index of the genotypes of each call. The final column has values 0 for unphased sampels or 1 for phased. So a sample with gt ‘0|2’ would be represented by the row [0, 2, 1] and a sample with gt ‘3/0’ would be represented by the row [3, 0, 0]. Uncalled haplotypes (represented by ‘.’ in the VCF) are represented by ‘-1’ genotypes. If the sample has fewer haplotypes than the maximum ploidy of all samples at this locus, then the row is padded with -2s, so a haploid sample with gt ‘1’ where other samples are diploid would be represented by the row [1, -2, 0]. If all the genotype columns for a sample are negative then the sample is a no call. Note: the value of the phasing column is unspecified for haploid or no-call samples.

Returns

The numpy array described above, of type int. If there are no samples in the vcf this record comes from then return None instead

Return type

Optional[np.ndarray]

GetLengthGenotypes()

Get an array of length genotypes for each sample.

Represents the sample’s genotype in terms of the number of repeats of the motif in each allele. Returns a pair of floats - alleles including partial repeats or other impurities may have noninteger lengths.

The array is as described in GetGenotypeIndicies() except that indicies are replaced by their length genotypes. -1s, -2s and the phasing bits are not modified.

For records with both regular and full sequences (those with flanking bps), this returns the length of the regular sequences

Returns

The numpy array described above, of type float If there are no samples in the vcf this record comes from then return None instead

Return type

Optional[np.ndarray]

GetMaxAllele(sample_index=None)

Get the maximum allele length called in a record.

Represents lengths in terms of the number of repeats of the motif. The longest allele may have a noninteger length if it includes partial repeats or other impurities.

For records with both regular and full sequences (those with flanking bps), this returns the length of the regular sequences

Parameters

sample_index (Optional[Any]) – Used to index the numpy array of samples. So can be a numpy array of sample indicies, or a bool array with length of the number of samples, etc. If None, then all samples are included.

Returns

maxallele – The maximum allele length called (in number of repeat units), or nan if no alleles called

Return type

float

GetMaxPloidy()

Return the maximum ploidy of any sample at this locus.

All genotypes will be a tuple of that many haplotypes, where samples with a smaller ploidy than that will have haplotypes at the end of the tuple set to ‘,’ (for string genotypes) or -2 (for index or length genotypes)

Return type

int

GetNumSamples()

Return the number of samples at this locus (called or not).

Same as the number of samples in the overall vcf

Return type

int

GetQualityScores()

Get the quality scores of the calls for each sample.

The meaning and reliability of these scores is genotyper dependent, see the doc section Quality Scores.

Returns

An array of quality score floats, one row per sample Samples which were not called have the value np.nan

Return type

np.ndarray

GetSamplePloidies()

Get an array listing the ploidies of each sample

Returns

An array of positive ints with length equal to the number of samples where each entry denotes the number of genotypes for each sample at this locus (including no calls) If there are no samples in the vcf this record comes from then return None instead

Return type

Optional[np.ndarray]

GetStringGenotypes()

Get an array of string genotypes for each sample.

The array is as described in GetGenotypeIndicies() except that the indicies are replaced by their corresponding sequences, -1 indicies (nocalls) are replaced by ‘.’, -2 indicies (not present due to smaller ploidy) are replaced by ‘,’, and the phasing bits (0 or 1) are replaced by the strings ‘0’ or ‘1’.

Will not include flanking base pairs. To get genotypes that include flanking base pairs (for callers that call those), use GetFullStringGenotypes(). For callers that include flanking base pairs it is possible that some of the alleles in the regular string genotypes (with the flanks stripped) will be identical. In this case, you may use UniqueStringGenotypeMapping() to get a canonical unique subset of indicies which represent all possible alleles.

Note that some TR callers will only call allele lengths, not allele sequences. In such a case, this method will return a fabricated sequence based on the called length (see trtools.utils.utils.FabricateAllele()) and a warning will be raised. This may not be intended - use GetLengthGenotypes() for a fully caller agnostic way of handling genotypes.

This method is inefficient for many samples, consider either using length genotypes (GetLengthGenotypes()), or using genotype indicies (GetGenotypeIndicies()) and accessing string genotypes as needed through the fields ref_allele and alt_alleles, instead.

Returns

The numpy array described above, of type ‘<UN’ where ‘N’ is the max allele length. If there are no samples in the vcf this record comes from then return None instead

Return type

Optional[np.ndarray]

HasFabricatedAltAlleles()

Determine if this record has fabricated alt alleles.

Returns

True iff alt_allele_lengths was passed to this record’s constructor.

Return type

bool

HasFabricatedRefAllele()

Determine if this record has a fabricated ref allels.

Returns

True iff ref_allele_length was passed to this record’s constructor.

Return type

bool

HasFullStringGenotypes()

Determine if this record has full string genotypes.

Returns

True iff GetFullStringGenotypes() will return a different value than GetStringGenotypes() for some alleles.

Return type

bool

HasQualityScores()

Does this TRRecord contain quality scores for each of its calls? If present, the meaning and reliability of these scores is genotyper dependent, see the doc section Quality Scores.

Returns

Whether or not a FORMAT field that could be interpreted as a quality score has been identified

Return type

boolean

UniqueLengthGenotypeMapping()

Get a mapping whose values are unique string genotype indicies.

Returns

genotypeMapping – A mapping allele idx -> allele idx whose keys are all allele indicies and whose values are a subset of indicies which represents all the unique length alleles for this variant. For variants where multiple alleles have the same length, all will map to a single index from among those alleles.

Return type

Dict[int, int]

UniqueLengthGenotypes()

Find allele indicies corresponding to the unique length alleles.

Equivalent to calling set(UniqueLengthGenotypeMapping().values())

Returns

The indicies of the unique string alleles

Return type

Set[int]

UniqueStringGenotypeMapping()

Get a mapping whose values are unique string genotype indicies.

Returns

A mapping allele idx -> allele idx whose keys are all allele indicies and whose values are a subset of indicies which represents all the unique regular string alleles for this variant. For almost all records, this will be a mapping from each index to itself. For some records with full string genotypes that include flanking base pairs, some of the regular string alleles will be identical. In this case, only one of those allele’s indicies will be in the set of values of this dictionary, and all identical alleles will map to that one index.

Return type

Dict[int, int]

UniqueStringGenotypes()

Find allele indicies corresponding to the unique alleles.

Equivalent to calling set(UniqueStringGenotypeMapping().values())

Returns

The indicies of the unique string alleles

Return type

Set[int]

_CheckRecord()

Check that this record is properly constructed.

Checks that the same number of alt alleles were specified as in the underlying record and that the full_alleles, if supplied, contain their corresponding standard alleles

Raises an error if a check fails

class trtools.utils.tr_harmonizer.TRRecordHarmonizer(vcffile, vcftype='auto')

Bases: object

Class producing a uniform interface for accessing TR VCF records.

Produces the same output interface regardless of the tool that created the input VCF.

The main purpose of this class is to infer which tool a VCF came from, and appropriately convert its records to TRRecord objects.

This class provides the object oriented paradigm for iterating through a TR vcf. If you wish to use the functional paradigm and provide the cyvcf2.Variant objects yourself, use the top-level functions in this module.

Parameters
  • vcffile (cyvcf2.VCF instance) –

  • vcftype ({'auto', 'gangstr', 'advntr', 'hipstr', 'eh', 'popstr'}, optional) – Type of the VCF file. Default=’auto’. If vcftype==’auto’, attempts to infer the type.

vcffile
Type

cyvcf2.VCF instance

vcftype

Type of the VCF file. Must be included in VcfTypes

Type

enum

Raises

TypeError – If the type of the VCF cannot be properly inferred. See InferVCFType() for more details.

Parameters
HasLengthAltGenotypes()

Determine if the alt alleles of variants are given by length.

See also

tr_harmonizer.HasLengthAltGenotypes

Return type

bool

HasLengthRefGenotype()

Determine if the reference alleles of variants are given by length.

See also

tr_harmonizer.HasLengthRefGenotype

Return type

bool

HasQualityScore()

Does this VCF contain quality scores for each of its calls? If present, the meaning and reliability of these scores is genotyper dependent, see the doc section Quality Scores.

Returns

Whether or not a FORMAT field that could be interpreted as a quality score has been identified

Return type

bool

IsBeagleVCF()

Is this a VCF produced by running the Beagle software to impute STRs from a panel generated by an TR genotyper?

See also

tr_harmonizer.IsBeagleVCF

Return type

bool

MayHaveImpureRepeats()

Determine if any of the alleles in this VCF may contain impure repeats.

See also

tr_harmonizer.MayHaveImpureRepeats

Return type

bool

class trtools.utils.tr_harmonizer.VcfTypes(value)

Bases: enum.Enum

The different tr callers that tr_harmonizer supports.

advntr = 'advntr'
eh = 'eh'
gangstr = 'gangstr'
hipstr = 'hipstr'
longtr = 'longtr'
popstr = 'popstr'
class trtools.utils.tr_harmonizer._Cyvcf2FormatDict(record)

Bases: object

Provide an immutable dict-like interface for accessing format fields from a cyvcf2 record. To iterate over this dict, use iter(this) or this.keys().

Parameters

record (cyvcf2.cyvcf2.Variant) –

get(key)
Parameters

key (str) –

keys()
trtools.utils.tr_harmonizer._HarmonizeAdVNTRRecord(vcfrecord)

Turn a cyvcf2.Variant with adVNTR content into a TRRecord.

Parameters

vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object

Returns

Return type

TRRecord

trtools.utils.tr_harmonizer._HarmonizeEHRecord(vcfrecord)

Turn a cyvcf2.Variant with EH content into a TRRecord.

Parameters

vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object

Returns

Return type

TRRecord

trtools.utils.tr_harmonizer._HarmonizeGangSTRRecord(vcfrecord)

Turn a cyvcf2.Variant with GangSTR content into a TRRecord.

Parameters

vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object

Returns

Return type

TRRecord

trtools.utils.tr_harmonizer._HarmonizeHipSTRRecord(vcfrecord)

Turn a cyvcf2.Variant with HipSTR content into a TRRecord.

Parameters

vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object

Returns

Return type

TRRecord

trtools.utils.tr_harmonizer._HarmonizePopSTRRecord(vcfrecord)

Turn a cyvcf2.Variant with popSTR content into a TRRecord.

Parameters

vcfrecord (cyvcf2.cyvcf2.Variant) – A cyvcf2.Variant Object

Returns

Return type

TRRecord