trtools.compareSTR module

trtools.compareSTR.CalcR2(format_bin_results)

Calculate the squared (pearson) correlation coefficient for the values in this bin.

Calculation is done using the formulas:

n = numcalls var(X) = sum(X_i**2)/n - [sum(X_i)/n]**2 covar(X,Y) = sum(X_i*Y_i)/n - sum(X_i)/n * sum(Y_i)/n r^2 = covar(X,Y)**2/(var(X) * var(Y))

Parameters

format_bin_results (Dict[str, int]) – See the method NewOverallForamtBin

Returns

r^2, or np.nan if one of the two vcfs has no variance in this format bin

Return type

float

trtools.compareSTR.GetBubbleLegend(coordinate_counts)

Get three good bubble legend sizes to use

They should be nice round numbers spanning the orders of magnitude of the dataset

Parameters

coordinate_counts – set of counts for coordinates in the graph

Returns

legend_values – List of three or fewer representative sample sizes to use for bubble legend

Return type

list of int

trtools.compareSTR.GetFormatFields(format_fields, format_binsizes, format_fileoption, vcfreaders)

Get which FORMAT fields to stratify on

Also perform some checking on user arguments

Parameters
  • format_fields (str) – Comma-separated list of FORMAT fields to stratify on

  • format_binsizes (str) – Comma-separated list of min:max:binsize, one for each FORMAT field.

  • format_fileoption ({0, 1, 2}) – Whether each format field needs to be in both readers (0), reader 1 (1) or reader 2 (2)

  • vcfreaders (list of vcf.Reader) – List of readers. Needed to check if required FORMAT fields are present

Returns

  • formats (list of str) – List of FORMAT fields to stratify on

  • format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field

trtools.compareSTR.NewOverallFormatBin()

Return an empty bin for the overall dictionary.

Returns

Contains the fields: conc_len_count conc_seq_cont numcalls total_len_1 total_len_2 total_len_11 total_len_12 total_len_22

Return type

Dict[str, Union[int, float]]

trtools.compareSTR.NewOverallPeriod(format_fields, format_bins)

Return an empty dictionary containing bins for each format stratification and for ‘ALL’ (no format stratification).

Returns

Return type

The empty dictionary.

trtools.compareSTR.OutputBubblePlot(bubble_results, outprefix, minval=None, maxval=None)

Output bubble plot of gtsum1 vs. gtsum2

Parameters
  • bubble_results – counts of sum1 vs sum2

  • outprefix (str) – Prefix to name output file

trtools.compareSTR.OutputLocusMetrics(locus_results, outprefix, noplot)

Output per-locus metrics

Outputs text file and plot of per-locus metrics outprefix + “-locuscompare.tab” outprefix + “-locuscompare.pdf”

Parameters
  • locus_results (Dict[str, Any]) – The info needed to write the output file

  • outprefix (str) – Prefix to name output file

  • noplot (bool) – If True, don’t output plots

trtools.compareSTR.OutputOverallMetrics(overall_results, format_fields, format_bins, outprefix)

Output overall accuracy metrics

Output metrics overall, by period, and by FORMAT bins Output results to outprefix+”-overall.tab”

Parameters
  • overall_results (Dict[str, Any]) – Info needed to write the tabfile

  • format_fields (List[str]) – List of FORMAT fields to stratify by

  • format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field

  • outprefix (str) – Prefix to name output file

trtools.compareSTR.OutputSampleMetrics(sample_results, sample_names, outprefix, noplot)

Output per-sample metrics

Outputs text file and plot of per-sample metrics outprefix + “-samplecompare.tab” outprefix + “-samplecompare.pdf”

Parameters
  • sample_results (Dict[str, any]) – The info needed to write the output file

  • sample_names (List[str]) –

  • outprefix (str) – Prefix to name output file

  • noplot (bool) – If True, don’t output plots

trtools.compareSTR.UpdateComparisonResults(record1, record2, sample_idxs, ignore_phasing, stratify_by_period, format_fields, format_bins, stratify_file, overall_results, locus_results, sample_results, bubble_results)

Extract comparable results from a pair of VCF records

Parameters
  • record1 (trh.TRRecord) – First record to compare

  • record2 (trh.TRRecord) – Second record to compare

  • sample_idxs (list of np.array) – Two arrays, one for each vcf Each array is a list of indicies so that vcf1.samples[index_array1] == vcf2.samples[index_array2] and that this is the set of shared samples

  • stratify_by_period (bool) – If True, also stratify results by period

  • format_fields (list of str) – List of format fields to extract

  • format_bins (List[List[float]]) – List of bin start/stop coords for each FORMAT field

  • stratify_file ({0, 1, 2}) – Specify whether to apply FORMAT stratification to both files (0), or only (1) or (2)

  • overall_results (dict) – Period and format nested dictionary to update.

  • locus_results (dict) – Locus-stratified results dictionary to update.

  • sample_results (dict) – Sample-stratified results dictionary to update.

  • bubble_results (dict) – dictionary of counts to update

trtools.compareSTR.check_region(contigs1, contigs2, region_str)
trtools.compareSTR.getargs()
trtools.compareSTR.handle_overlaps(records, chrom_indices, min_chrom_index)

This function determines whether (two) records in list are comparable Currently only works with record lists which are two records long

Parameters
  • records (List[Optional[trh.TRRecord]]) – List of TRRecords whose comparability is to be determined. If any of them is None, they are not comparable

  • chrom_indices (List[int]) – List of indices of chromosomes of current records

  • min_chrom_index (int) – Smallest index in chrom_indices. All records should have the same chrom_index, otherwise they are not comparable

Returns

comparable – Result, that says whether records are comparable

Return type

bool

trtools.compareSTR.main(args)
trtools.compareSTR.run()