coalispr.bedgraph_analyze.bam_keys_collators¶
Module with collators to organize information extracted from bamfiles.
Attributes¶
Exceptions¶
Throw an exception when dataframe is empty or has only 0 values |
Classes¶
Singleton to collect sample count controllers for downstream processing. |
Module Contents¶
- coalispr.bedgraph_analyze.bam_keys_collators.logger¶
- coalispr.bedgraph_analyze.bam_keys_collators.IND¶
- coalispr.bedgraph_analyze.bam_keys_collators.BINNERS = 'binners'¶
- coalispr.bedgraph_analyze.bam_keys_collators.SIZERS = 'sizers'¶
- coalispr.bedgraph_analyze.bam_keys_collators.MMAPPERS = 'mmappers'¶
- class coalispr.bedgraph_analyze.bam_keys_collators.Bam_samples_collator¶
Bases:
objectSingleton to collect sample count controllers for downstream processing.
- Parameters:
sample_controllers (list) – Collection of sample controllers to combine similar counters in one dataframe for export and saving as tsv.
binners (dict) – Collection of key counters for given label to combine comparable counts in one dataframe for export and saving as tsv.
sizers (dict) – Collection of key length-counters for given label to combine comparable counts in one dataframe for export and saving as tsv.
mmappers (dict) – Collection of key multimap counters for given label to combine comparable counts in one dataframe for export and saving as tsv.
Notes
Counter groups defined in
2_shared.txt(SHARED) and3_EXP.txt(EXPTXT) are used here for general read counters.CNTREAD = [LIBR, UNIQ, XTRA, UNSEL] # MULMAP = LIBR - UNIQ CNTCDNA = [COLLR, UNIQ+COLLR] # COLLR+MULMAP = COLLR - (UNIQ+COLLR) CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP] CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ] LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] LENCNTRS = [LENREAD, LENCDNA, LENGAP] MMAPCNTRS = [ [LIBR, INTR] ]
For region counting, a restricted selection has been defined with CNTSKIP = [SKIP].
REGCNTRS = [ [LIBR, UNIQ], [COLLR, UNIQ+COLLR] , CNTSKIP ]
- sample_controllers = []¶
- binner_dict: {}¶
- sizer_dict: {}¶
- mmapper_dict: {}¶
- count_frames: {}¶
- classmethod add_sample_controller(a_Bam_sample_counter_controller)¶
- classmethod set_sample_controllers(alist)¶
- classmethod generate_frame_building_counters()¶
Create dataframes from counters.
- Parameters:
up (str)
dwn (str) – Labels for ‘strand’, either MUNR/CORB or PLUS/MINUS.
- classmethod recombine_controllers()¶
After counting, process all; keep stranded counts apart.
- classmethod combine_counters()¶
Return dataframes built from collected sample count controllers.
- classmethod floatformat(label)¶
Set formatting for dataframes to be saved to count files via dtype and rounding.
A dtype can be
int,"Int64"(pd.Int64Dtype()) orfloat. To let pandas choose appropriate Numpy format useintandfloat; nullable integer"Int64"takespd.NAfor missing value instead ofNaN(dtypefloat) after merging frames without perfect index overlap. (see https://pandas.pydata.org/docs/user_guide/integer_na.html) For rounding, withfloatandInt64, callingastype(dtype)is not needed at end for bin-counters but needed for length-counters or in the case ofintbefore saving to file. Thefloat_formatfunction (commented out) could be used instead. Pre-indexed frames do not lead to a speed-up. Maybe constructing a dataframe from dicts (collections.Counter) after counting does.
- classmethod save_to_tsv(tsvpath, bins)¶
Save the counts to TSV files.
- Parameters:
tsvpath (Path) – Path for folder to store files
bins (int) – Number of sections a counted region is split into.
- classmethod save_bincounters_to_tsv(tsvpath, frames, label, bins)¶
Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.
- Parameters:
counter
tsvpath (Path) – Path to folder to store TSV file with count information.
frames (dict # {MUNR: pd.DataFrame, CORB: pd.DataFrame})
label (str) – Name of collated counters
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).
- classmethod save_skip_counters_to_tsv(tsvpath, notfram, label, arg, *args)¶
Save skipped read counts file; use lowercase filename.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
notfram (pd.DataFrame) – frame with skipped read counts.
label (str) – SKIP
arg (int (when bins); or str (when region); default: None)
*args (could be strand for region counting)
- classmethod save_length_counters_to_tsv(tsvpath, frames, label, cntlabl, idxnam, bins=None)¶
Save length frames for separate strands and samples, then for all samples combined. Use lowercase filenames.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
frames (dict) –
- Dict of collated countseries for each strand:
{MUNR: pd.DataFrame, CORB: pd.DataFrame}
label (str) – Name of collated counters
cntlabl (str) – Label to mark type of counts in saved file name.
idxnam (str) – Name for index column when dataframes get saved to TSV.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).
- classmethod get_region_frames(comparereads, noskip=True)¶
- Parameters:
comparereads (list) – List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]
noskip (bool) – Flag to determine whether SKIP counts need to be omitted; include for printing; exclude for making count table or figures.
Notes
REGCNTRS = [ [LIBR, UNIQ], [COLLR, UNIQ+COLLR] , CNTSKIP ] #CNTSKIP = [SKIP]
- classmethod get_region_count_frame(comparereads)¶
Get dataframe with all counts. Add multimappers.
- Parameters:
comparereads (list) – List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]
- classmethod save_region_counters_to_tsv(tsvpath, region, strand, comparereads)¶
Save dataframes with region counts for separate samples. Use lowercase filenames.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
region (str) – Formatted descriptor of genome span counted.
strand (str) – Strand to count/analyze mapped reads for, one of COMBI, PLUS or MINUS.
comparereads (list) – List of labels to indicate which reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]
- classmethod save_frame_with_region_lengthcounts_to_tsv(tsvpath, frame, label, cntlabl, idxnam, region, strand)¶
Save dataframes with region counts for separate samples. Use lowercase filenames.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
frame (pandas.DataFrame) – Dataframe that holds counts.
label (str) – Name of counter.
cntlabel (str) – Label to mark type of counts in saved file name.
idxnam (str) – Name for index column when dataframes get saved to TSV.
region (str) – Formatted descriptor of genome span counted.
strand (str) – One of COMBI, PLUS or MINUS;
- exception coalispr.bedgraph_analyze.bam_keys_collators.Zeros_frame_exception¶
Bases:
ExceptionThrow an exception when dataframe is empty or has only 0 values