coalispr.bedgraph_analyze.bam_keys_collators

Module with collators to organize information extracted from bamfiles.

Attributes

Exceptions

Zeros_frame_exception

Throw an exception when dataframe is empty or has only 0 values

Classes

Bam_samples_collator

Singleton to collect sample count controllers for downstream processing.

Module Contents

coalispr.bedgraph_analyze.bam_keys_collators.logger
coalispr.bedgraph_analyze.bam_keys_collators.IND
coalispr.bedgraph_analyze.bam_keys_collators.BINNERS = 'binners'
coalispr.bedgraph_analyze.bam_keys_collators.SIZERS = 'sizers'
coalispr.bedgraph_analyze.bam_keys_collators.MMAPPERS = 'mmappers'
class coalispr.bedgraph_analyze.bam_keys_collators.Bam_samples_collator

Bases: object

Singleton to collect sample count controllers for downstream processing.

Parameters:
  • sample_controllers (list) – Collection of sample controllers to combine similar counters in one dataframe for export and saving as tsv.

  • binners (dict) – Collection of key counters for given label to combine comparable counts in one dataframe for export and saving as tsv.

  • sizers (dict) – Collection of key length-counters for given label to combine comparable counts in one dataframe for export and saving as tsv.

  • mmappers (dict) – Collection of key multimap counters for given label to combine comparable counts in one dataframe for export and saving as tsv.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here for general read counters.

CNTREAD        = [LIBR, UNIQ, XTRA, UNSEL]  # MULMAP = LIBR - UNIQ
CNTCDNA        = [COLLR, UNIQ+COLLR]        # COLLR+MULMAP = COLLR - (UNIQ+COLLR)
CNTGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]
#CNTSKIP        = [SKIP]

CNTRS          = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

LENREAD        = [LIBR, UNIQ, XTRA, UNSEL]
LENCDNA        = [COLLR, UNIQ+COLLR]
LENGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS       = [LENREAD, LENCDNA, LENGAP]

MMAPCNTRS      = [ [LIBR, INTR] ]

For region counting, a restricted selection has been defined with CNTSKIP = [SKIP].

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]
sample_controllers = []
binner_dict: {}
sizer_dict: {}
mmapper_dict: {}
count_frames: {}
classmethod add_sample_controller(a_Bam_sample_counter_controller)
classmethod set_sample_controllers(alist)
classmethod generate_frame_building_counters()

Create dataframes from counters.

Parameters:
  • up (str)

  • dwn (str) – Labels for ‘strand’, either MUNR/CORB or PLUS/MINUS.

classmethod recombine_controllers()

After counting, process all; keep stranded counts apart.

classmethod combine_counters()

Return dataframes built from collected sample count controllers.

classmethod floatformat(label)

Set formatting for dataframes to be saved to count files via dtype and rounding.

A dtype can be int, "Int64" (pd.Int64Dtype()) or float. To let pandas choose appropriate Numpy format use int and float; nullable integer "Int64" takes pd.NA for missing value instead of NaN (dtype float) after merging frames without perfect index overlap. (see https://pandas.pydata.org/docs/user_guide/integer_na.html) For rounding, with float and Int64, calling astype(dtype) is not needed at end for bin-counters but needed for length-counters or in the case of int before saving to file. The float_format function (commented out) could be used instead. Pre-indexed frames do not lead to a speed-up. Maybe constructing a dataframe from dicts (collections.Counter) after counting does.

classmethod save_to_tsv(tsvpath, bins)

Save the counts to TSV files.

Parameters:
  • tsvpath (Path) – Path for folder to store files

  • bins (int) – Number of sections a counted region is split into.

classmethod save_bincounters_to_tsv(tsvpath, frames, label, bins)

Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.

Parameters:
  • counter

  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • frames (dict # {MUNR: pd.DataFrame, CORB: pd.DataFrame})

  • label (str) – Name of collated counters

  • bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

classmethod save_skip_counters_to_tsv(tsvpath, notfram, label, arg, *args)

Save skipped read counts file; use lowercase filename.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • notfram (pd.DataFrame) – frame with skipped read counts.

  • label (str) – SKIP

  • arg (int (when bins); or str (when region); default: None)

  • *args (could be strand for region counting)

classmethod save_length_counters_to_tsv(tsvpath, frames, label, cntlabl, idxnam, bins=None)

Save length frames for separate strands and samples, then for all samples combined. Use lowercase filenames.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • frames (dict) –

    Dict of collated countseries for each strand:

    {MUNR: pd.DataFrame, CORB: pd.DataFrame}

  • label (str) – Name of collated counters

  • cntlabl (str) – Label to mark type of counts in saved file name.

  • idxnam (str) – Name for index column when dataframes get saved to TSV.

  • bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

classmethod get_region_frames(comparereads, noskip=True)
Parameters:
  • comparereads (list) – List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]

  • noskip (bool) – Flag to determine whether SKIP counts need to be omitted; include for printing; exclude for making count table or figures.

Notes

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]
#CNTSKIP        = [SKIP]
classmethod get_region_count_frame(comparereads)

Get dataframe with all counts. Add multimappers.

Parameters:

comparereads (list) – List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]

classmethod save_region_counters_to_tsv(tsvpath, region, strand, comparereads)

Save dataframes with region counts for separate samples. Use lowercase filenames.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • region (str) – Formatted descriptor of genome span counted.

  • strand (str) – Strand to count/analyze mapped reads for, one of COMBI, PLUS or MINUS.

  • comparereads (list) – List of labels to indicate which reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]

classmethod save_frame_with_region_lengthcounts_to_tsv(tsvpath, frame, label, cntlabl, idxnam, region, strand)

Save dataframes with region counts for separate samples. Use lowercase filenames.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • frame (pandas.DataFrame) – Dataframe that holds counts.

  • label (str) – Name of counter.

  • cntlabel (str) – Label to mark type of counts in saved file name.

  • idxnam (str) – Name for index column when dataframes get saved to TSV.

  • region (str) – Formatted descriptor of genome span counted.

  • strand (str) – One of COMBI, PLUS or MINUS;

exception coalispr.bedgraph_analyze.bam_keys_collators.Zeros_frame_exception

Bases: Exception

Throw an exception when dataframe is empty or has only 0 values