coalispr.bedgraph_analyze.bam_keys_collators¶

Module with collators to organize information extracted from bamfiles.

Attributes¶

`logger`
`IND`
`BINNERS`
`SIZERS`
`MMAPPERS`

Exceptions¶

Zeros_frame_exception

Throw an exception when dataframe is empty or has only 0 values

Classes¶

Bam_samples_collator

Singleton to collect sample count controllers for downstream processing.

Module Contents¶

coalispr.bedgraph_analyze.bam_keys_collators.logger¶

coalispr.bedgraph_analyze.bam_keys_collators.IND¶

coalispr.bedgraph_analyze.bam_keys_collators.BINNERS = 'binners'¶

coalispr.bedgraph_analyze.bam_keys_collators.SIZERS = 'sizers'¶

coalispr.bedgraph_analyze.bam_keys_collators.MMAPPERS = 'mmappers'¶

class coalispr.bedgraph_analyze.bam_keys_collators.Bam_samples_collator¶

Bases: object

Singleton to collect sample count controllers for downstream processing.

Parameters:

sample_controllers (list) – Collection of sample controllers to combine similar counters in one dataframe for export and saving as tsv.
binners (dict) – Collection of key counters for given label to combine comparable counts in one dataframe for export and saving as tsv.
sizers (dict) – Collection of key length-counters for given label to combine comparable counts in one dataframe for export and saving as tsv.
mmappers (dict) – Collection of key multimap counters for given label to combine comparable counts in one dataframe for export and saving as tsv.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here for general read counters.

CNTREAD        = [LIBR, UNIQ, XTRA, UNSEL]  # MULMAP = LIBR - UNIQ
CNTCDNA        = [COLLR, UNIQ+COLLR]        # COLLR+MULMAP = COLLR - (UNIQ+COLLR)
CNTGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]
#CNTSKIP        = [SKIP]

CNTRS          = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

LENREAD        = [LIBR, UNIQ, XTRA, UNSEL]
LENCDNA        = [COLLR, UNIQ+COLLR]
LENGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS       = [LENREAD, LENCDNA, LENGAP]

MMAPCNTRS      = [ [LIBR, INTR] ]

For region counting, a restricted selection has been defined with CNTSKIP = [SKIP].

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]

sample_controllers = []¶

binner_dict: {}¶

sizer_dict: {}¶

mmapper_dict: {}¶

count_frames: {}¶

classmethod add_sample_controller(a_Bam_sample_counter_controller)¶

classmethod set_sample_controllers(alist)¶

classmethod generate_frame_building_counters()¶

Create dataframes from counters.

Parameters:

up (str)
dwn (str) – Labels for ‘strand’, either MUNR/CORB or PLUS/MINUS.

classmethod recombine_controllers()¶: After counting, process all; keep stranded counts apart.

classmethod combine_counters()¶: Return dataframes built from collected sample count controllers.

classmethod floatformat(label)¶

Set formatting for dataframes to be saved to count files via dtype and rounding.

A dtype can be int, "Int64" (pd.Int64Dtype()) or float. To let pandas choose appropriate Numpy format use int and float; nullable integer "Int64" takes pd.NA for missing value instead of NaN (dtype float) after merging frames without perfect index overlap. (see https://pandas.pydata.org/docs/user_guide/integer_na.html) For rounding, with float and Int64, calling astype(dtype) is not needed at end for bin-counters but needed for length-counters or in the case of int before saving to file. The float_format function (commented out) could be used instead. Pre-indexed frames do not lead to a speed-up. Maybe constructing a dataframe from dicts (collections.Counter) after counting does.

classmethod save_to_tsv(tsvpath, bins)¶

Save the counts to TSV files.

Parameters:

tsvpath (Path) – Path for folder to store files
bins (int) – Number of sections a counted region is split into.

classmethod save_bincounters_to_tsv(tsvpath, frames, label, bins)¶

Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.

Parameters:

counter
tsvpath (Path) – Path to folder to store TSV file with count information.
frames (dict # {MUNR: pd.DataFrame, CORB: pd.DataFrame})
label (str) – Name of collated counters
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

classmethod save_skip_counters_to_tsv(tsvpath, notfram, strand=None)¶

Save skipped read counts file; use lowercase filename.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
notfram (pd.DataFrame) – frame with skipped read counts.
strand (str) – Strand (PLUS or MINUS) for region counting.

classmethod save_length_counters_to_tsv(tsvpath, frames, label, cntlabl, idxnam, bins=None)¶

Save length frames for separate strands and samples, then for all samples combined. Use lowercase filenames.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
frames (dict) –

Dict of collated countseries for each strand:
{MUNR: pd.DataFrame, CORB: pd.DataFrame}
label (str) – Name of collated counters
cntlabl (str) – Label to mark type of counts in saved file name.
idxnam (str) – Name for index column when dataframes get saved to TSV.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

classmethod get_region_frames(comparereads, noskip=True)¶

Parameters:

comparereads (list) – List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]
noskip (bool) – Flag to determine whether SKIP counts need to be omitted; include for printing; exclude for making count table or figures.

Notes

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]
#CNTSKIP        = [SKIP]

classmethod get_region_count_frame(comparereads)¶

Get dataframe with all counts. Add multimappers.

Parameters:: comparereads (list) – List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]

classmethod save_region_counters_to_tsv(tsvpath, region, strand, comparereads)¶

Save dataframes with region counts for separate samples. Use lowercase filenames.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
region (str) – Formatted descriptor of genome span counted.
strand (str) – Strand to count/analyze mapped reads for, one of COMBI, PLUS or MINUS.
comparereads (list) – List of labels to indicate which reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]

classmethod save_frame_with_region_lengthcounts_to_tsv(tsvpath, frame, label, cntlabl, idxnam, region, strand)¶

Save dataframes with region counts for separate samples. Use lowercase filenames.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
frame (pandas.DataFrame) – Dataframe that holds counts.
label (str) – Name of counter.
cntlabel (str) – Label to mark type of counts in saved file name.
idxnam (str) – Name for index column when dataframes get saved to TSV.
region (str) – Formatted descriptor of genome span counted.
strand (str) – One of COMBI, PLUS or MINUS;

exception coalispr.bedgraph_analyze.bam_keys_collators.Zeros_frame_exception¶

Bases: Exception

Throw an exception when dataframe is empty or has only 0 values