trtools.compareSTR module
- trtools.compareSTR.CalcR2(format_bin_results)
Calculate the squared (pearson) correlation coefficient for the values in this bin.
- Calculation is done using the formulas:
n = numcalls var(X) = sum(X_i**2)/n - [sum(X_i)/n]**2 covar(X,Y) = sum(X_i*Y_i)/n - sum(X_i)/n * sum(Y_i)/n r^2 = covar(X,Y)**2/(var(X) * var(Y))
- Parameters
format_bin_results (Dict[str, int]) – See the method NewOverallForamtBin
- Returns
r^2, or np.nan if one of the two vcfs has no variance in this format bin
- Return type
float
- trtools.compareSTR.GetBubbleLegend(coordinate_counts)
Get three good bubble legend sizes to use
They should be nice round numbers spanning the orders of magnitude of the dataset
- Parameters
coordinate_counts – set of counts for coordinates in the graph
- Returns
legend_values – List of three or fewer representative sample sizes to use for bubble legend
- Return type
list of int
- trtools.compareSTR.GetFormatFields(format_fields, format_binsizes, format_fileoption, vcfreaders)
Get which FORMAT fields to stratify on
Also perform some checking on user arguments
- Parameters
format_fields (str) – Comma-separated list of FORMAT fields to stratify on
format_binsizes (str) – Comma-separated list of min:max:binsize, one for each FORMAT field.
format_fileoption ({0, 1, 2}) – Whether each format field needs to be in both readers (0), reader 1 (1) or reader 2 (2)
vcfreaders (list of vcf.Reader) – List of readers. Needed to check if required FORMAT fields are present
- Returns
formats (list of str) – List of FORMAT fields to stratify on
format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field
- trtools.compareSTR.NewOverallFormatBin()
Return an empty bin for the overall dictionary.
- Returns
Contains the fields: conc_len_count conc_seq_cont numcalls total_len_1 total_len_2 total_len_11 total_len_12 total_len_22
- Return type
Dict[str, Union[int, float]]
- trtools.compareSTR.NewOverallPeriod(format_fields, format_bins)
Return an empty dictionary containing bins for each format stratification and for ‘ALL’ (no format stratification).
- Returns
- Return type
The empty dictionary.
- trtools.compareSTR.OutputBubblePlot(bubble_results, outprefix, minval=None, maxval=None)
Output bubble plot of gtsum1 vs. gtsum2
- Parameters
bubble_results – counts of sum1 vs sum2
outprefix (str) – Prefix to name output file
- trtools.compareSTR.OutputLocusMetrics(locus_results, outprefix, noplot)
Output per-locus metrics
Outputs text file and plot of per-locus metrics outprefix + “-locuscompare.tab” outprefix + “-locuscompare.pdf”
- Parameters
locus_results (Dict[str, Any]) – The info needed to write the output file
outprefix (str) – Prefix to name output file
noplot (bool) – If True, don’t output plots
- trtools.compareSTR.OutputOverallMetrics(overall_results, format_fields, format_bins, outprefix)
Output overall accuracy metrics
Output metrics overall, by period, and by FORMAT bins Output results to outprefix+”-overall.tab”
- Parameters
overall_results (Dict[str, Any]) – Info needed to write the tabfile
format_fields (List[str]) – List of FORMAT fields to stratify by
format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field
outprefix (str) – Prefix to name output file
- trtools.compareSTR.OutputSampleMetrics(sample_results, sample_names, outprefix, noplot)
Output per-sample metrics
Outputs text file and plot of per-sample metrics outprefix + “-samplecompare.tab” outprefix + “-samplecompare.pdf”
- Parameters
sample_results (Dict[str, any]) – The info needed to write the output file
sample_names (List[str]) –
outprefix (str) – Prefix to name output file
noplot (bool) – If True, don’t output plots
- trtools.compareSTR.UpdateComparisonResults(record1, record2, sample_idxs, ignore_phasing, stratify_by_period, format_fields, format_bins, stratify_file, overall_results, locus_results, sample_results, bubble_results)
Extract comparable results from a pair of VCF records
- Parameters
record1 (trh.TRRecord) – First record to compare
record2 (trh.TRRecord) – Second record to compare
sample_idxs (list of np.array) – Two arrays, one for each vcf Each array is a list of indicies so that vcf1.samples[index_array1] == vcf2.samples[index_array2] and that this is the set of shared samples
stratify_by_period (bool) – If True, also stratify results by period
format_fields (list of str) – List of format fields to extract
format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field
stratify_file ({0, 1, 2}) – Specify whether to apply FORMAT stratification to both files (0), or only (1) or (2)
overall_results (dict) – Period and format nested dictionary to update.
locus_results (dict) – Locus-stratified results dictionary to update.
sample_results (dict) – Sample-stratified results dictionary to update.
bubble_results (dict) – dictionary of counts to update
- trtools.compareSTR.check_region(contigs1, contigs2, region_str)
- trtools.compareSTR.getargs()
- trtools.compareSTR.handle_overlaps(records, chrom_indices, min_chrom_index)
This function determines whether (two) records in list are comparable Currently only works with record lists which are two records long
- Parameters
records (List[Optional[trh.TRRecord]]) – List of TRRecords whose comparability is to be determined. If any of them is None, they are not comparable
chrom_indices (List[int]) – List of indices of chromosomes of current records
min_chrom_index (int) – Smallest index in chrom_indices. All records should have the same chrom_index, otherwise they are not comparable
- Returns
comparable – Result, that says whether records are comparable
- Return type
bool
- trtools.compareSTR.main(args)
- trtools.compareSTR.run()