trtools.compareSTR module

trtools.compareSTR.CalcR2(format_bin_results)

Calculate the squared (pearson) correlation coefficient for the values in this bin.

Calculation is done using the formulas:: n = numcalls var(X) = sum(X_i**2)/n - [sum(X_i)/n]**2 covar(X,Y) = sum(X_i*Y_i)/n - sum(X_i)/n * sum(Y_i)/n r^2 = covar(X,Y)**2/(var(X) * var(Y))

Parameters: format_bin_results (Dict[str, int]) – See the method NewOverallForamtBin
Returns: r^2, or np.nan if one of the two vcfs has no variance in this format bin
Return type: float

trtools.compareSTR.GetBubbleLegend(coordinate_counts)

Get three good bubble legend sizes to use

They should be nice round numbers spanning the orders of magnitude of the dataset

Parameters: coordinate_counts – set of counts for coordinates in the graph
Returns: legend_values – List of three or fewer representative sample sizes to use for bubble legend
Return type: list of int

trtools.compareSTR.GetFormatFields(format_fields, format_binsizes, format_fileoption, vcfreaders)

Get which FORMAT fields to stratify on

Also perform some checking on user arguments

Parameters

format_fields (str) – Comma-separated list of FORMAT fields to stratify on
format_binsizes (str) – Comma-separated list of min:max:binsize, one for each FORMAT field.
format_fileoption ({0, 1, 2}) – Whether each format field needs to be in both readers (0), reader 1 (1) or reader 2 (2)
vcfreaders (list of vcf.Reader) – List of readers. Needed to check if required FORMAT fields are present

Returns

formats (list of str) – List of FORMAT fields to stratify on
format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field

trtools.compareSTR.NewOverallFormatBin()

Return an empty bin for the overall dictionary.

Returns: Contains the fields: conc_len_count conc_seq_cont numcalls total_len_1 total_len_2 total_len_11 total_len_12 total_len_22
Return type: Dict[str, Union[int, float]]

trtools.compareSTR.NewOverallPeriod(format_fields, format_bins)

Return an empty dictionary containing bins for each format stratification and for ‘ALL’ (no format stratification).

Returns
Return type: The empty dictionary.

trtools.compareSTR.OutputBubblePlot(bubble_results, outprefix, minval=None, maxval=None)

Output bubble plot of gtsum1 vs. gtsum2

Parameters

bubble_results – counts of sum1 vs sum2
outprefix (str) – Prefix to name output file

trtools.compareSTR.OutputLocusMetrics(locus_results, outprefix, noplot)

Output per-locus metrics

Outputs text file and plot of per-locus metrics outprefix + “-locuscompare.tab” outprefix + “-locuscompare.pdf”

Parameters

locus_results (Dict[str, Any]) – The info needed to write the output file
outprefix (str) – Prefix to name output file
noplot (bool) – If True, don’t output plots

trtools.compareSTR.OutputOverallMetrics(overall_results, format_fields, format_bins, outprefix)

Output overall accuracy metrics

Output metrics overall, by period, and by FORMAT bins Output results to outprefix+”-overall.tab”

Parameters

overall_results (Dict[str, Any]) – Info needed to write the tabfile
format_fields (List[str]) – List of FORMAT fields to stratify by
format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field
outprefix (str) – Prefix to name output file

trtools.compareSTR.OutputSampleMetrics(sample_results, sample_names, outprefix, noplot)

Output per-sample metrics

Outputs text file and plot of per-sample metrics outprefix + “-samplecompare.tab” outprefix + “-samplecompare.pdf”

Parameters

sample_results (Dict[str, any]) – The info needed to write the output file
sample_names (List[str]) –
outprefix (str) – Prefix to name output file
noplot (bool) – If True, don’t output plots

trtools.compareSTR.UpdateComparisonResults(record1, record2, sample_idxs, ignore_phasing, stratify_by_period, format_fields, format_bins, stratify_file, overall_results, locus_results, sample_results, bubble_results)

Extract comparable results from a pair of VCF records

Parameters

record1 (trh.TRRecord) – First record to compare
record2 (trh.TRRecord) – Second record to compare
sample_idxs (list of np.array) – Two arrays, one for each vcf Each array is a list of indicies so that vcf1.samples[index_array1] == vcf2.samples[index_array2] and that this is the set of shared samples
stratify_by_period (bool) – If True, also stratify results by period
format_fields (list of str) – List of format fields to extract
format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field
stratify_file ({0, 1, 2}) – Specify whether to apply FORMAT stratification to both files (0), or only (1) or (2)
overall_results (dict) – Period and format nested dictionary to update.
locus_results (dict) – Locus-stratified results dictionary to update.
sample_results (dict) – Sample-stratified results dictionary to update.
bubble_results (dict) – dictionary of counts to update

trtools.compareSTR.check_region(contigs1, contigs2, region_str)

trtools.compareSTR.getargs()

trtools.compareSTR.handle_overlaps(records, chrom_indices, min_chrom_index)

This function determines whether (two) records in list are comparable Currently only works with record lists which are two records long

Parameters

records (List[Optional[trh.TRRecord]]) – List of TRRecords whose comparability is to be determined. If any of them is None, they are not comparable
chrom_indices (List[int]) – List of indices of chromosomes of current records
min_chrom_index (int) – Smallest index in chrom_indices. All records should have the same chrom_index, otherwise they are not comparable

Returns

comparable – Result, that says whether records are comparable

Return type

bool

trtools.compareSTR.main(args)

trtools.compareSTR.run()