coalispr.bedgraph_analyze.bam_counters¶

Module with counters for extracting information from bamfiles.

Attributes¶

`logger`
`COLS`

Exceptions¶

ZerosFrameException

Common base class for all non-exit exceptions.

Classes¶

`BamCounterController`	Interface for counting; calls counter functions.
`BamRegionCountController`	Interface for counting particular regions; calls counter functions.
`Counter`	Interface class for making data-frame counters.
`BinCounter`	Class for making data-frame counters, with bin-regions in index.
`SkipCounter`	Class to keep track of skipped reads with imperfect alignments
`LengthCounter`	Class for making data-frame counters cataloging lengths.
`MultiMapCounter`	Class for making data-frame counters cataloging multimappers.
`RegionLengthCounter`	Class for making data-frame counters cataloging region counts

Functions¶

`set_cols`(colslist)	Define the list of samples that are counted.
`get_cols`()	Return the list of samples for which counts are gathered.
`count_frame`([dtype])	Base dataframe with multi-index and column-keys to store counts.

Module Contents¶

coalispr.bedgraph_analyze.bam_counters.logger¶

coalispr.bedgraph_analyze.bam_counters.COLS = []¶

coalispr.bedgraph_analyze.bam_counters.set_cols(colslist)¶

Define the list of samples that are counted.

Notes

The colslist is defined by functions (process_bamfiles or process_reads_for_region) creating the count controllers (as beancounter), in coalispr.bedgraph_analyze.process_bamdata.

Parameters:: colslist (list) – List of samples that form column index of dataframes storing the counts.

coalispr.bedgraph_analyze.bam_counters.get_cols()¶: Return the list of samples for which counts are gathered.

coalispr.bedgraph_analyze.bam_counters.count_frame(dtype=float)¶

Base dataframe with multi-index and column-keys to store counts.

Parameters:: dtype (pandas.dtype) – Datatype for dataframe
Return type:: Empty dataframe to be filled when processing bam file.

class coalispr.bedgraph_analyze.bam_counters.BamCounterController¶

Interface for counting; calls counter functions.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here.

CNTREAD        = [LIBR, UNIQ, XTRA, UNSEL]  # MULMAP = LIBR - UNIQ
CNTCDNA        = [COLLR, UNIQ+COLLR]        # COLLR+MULMAP = COLLR - (UNIQ+COLLR)
CNTGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]
#CNTSKIP        = [SKIP]

CNTRS          = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

LENREAD        = [LIBR, UNIQ, XTRA, UNSEL]
LENCDNA        = [COLLR, UNIQ+COLLR]
LENGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS       = [LENREAD, LENCDNA, LENGAP]

MMAPCNTRS      = [ [LIBR, INTR] ]

binners¶

sizers¶

mmappers¶

set_bincounts(cntidx, key)¶

Update bincounters with key-linked region counts.

Parameters:

cntidx (tuple) – (region, binno, binregion)
key (str) – Name of the Series (sample that is counted)

merge_lencounters(key)¶

Update length counter frames with key-linked info.

Parameters:: key (str) – Name of the Series (sample that is counted)

skip_count(val=1)¶: Add to SkipCounter only.

report_skipped(key)¶: Feedback on missed counts for each sample/key.

update_strand_count(label, val, strand, lenreadidx)¶

Set in-read count for given counter

Parameters:

label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(label, val, strand, nhreadidx)¶

Set count for multimapper hit-number (NH) for given counter.

Parameters:

label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
nhreadidx (index) – Index describing hit/repeat-number to add counts to.

save_to_tsv(tsvpath, bins)¶

Save the counts to TSV files.

Parameters:

tsvpath (Path) – Path for folder to store files
bins (int) – Number of sections a counted region is split into.

class coalispr.bedgraph_analyze.bam_counters.BamRegionCountController(region, comparereads, strand)¶

Bases: BamCounterController

Interface for counting particular regions; calls counter functions.

Notes

Of the counter groups defined in ‘2_shared.txt` (SHARED) needed here are:

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]

region¶

comparereads¶

strand¶

binners¶

sizers¶

mmappers = None¶

merge_lencounters(key)¶

Update length counter frames with key-linked info.

Parameters:: key (str) – Name of the Series (sample that is counted)

save_to_tsv(tsvpath, region, strand)¶

Save the counts to TSV files.

Parameters:

tsvpath (Path) – Path for folder to store files
region (str) – Formatted descriptor for counted genome span; f”{chrnam}_{region[0]}-{region[1]}”.
strand (str) – One of COMBI, PLUS or MINUS;

get_lencount_frames()¶: Get dataframes with counts.

get_count_frames()¶: Get dataframe with counts. Add multimappers.

class coalispr.bedgraph_analyze.bam_counters.Counter(label)¶

Interface class for making data-frame counters.

Notes

Dataframes used as counters in view of fractional counts for multimappers. Strand-specific counting can be facilitated.

label¶

Name for BinCounter, from the CNTRS list

Type:: str

dtype¶: Can be int, "Int64" (pd.Int64Dtype()) or float. To let pandas choose appropriate Numpy format use int and float; nullable integer "Int64" takes pd.NA for missing value instead of NaN (dtype float) after merging frames without perfect index overlap (see https://pandas.pydata.org/docs/user_guide/integer_na.html). For rounding, with float and Int64, calling astype(dtype) is not needed at end for bin-counters but needed for length-counters or in the case of int before saving to file. The float_format function (commented out) could be used instead. Pre-indexed frames do not lead to a speed-up.

Parameters:

munrfram (pd.DataFrame) – Frame to hold counts for reads on munro-strand, divided over regions and bins (as index) with SHORT names as column keys.
corbfram (pd.DataFrame) – As munrfram but holding counts for reads from opposite strand.
munrcount (int or float) – Tracks munro-strand associated counts linked to region iterated; reset when a new region is iterated over.
corbcount (int or float) – Tracks corbett-strand associated counts for region iterated

label¶

update_strand_count(val, strand, lenreadidx)¶: Set in-read count

set_bincounts(cntidx, key)¶

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).

merge_lencounters(key)¶: Merge counters; pass here, only valid for LengthCounter.

save_to_tsv(tsvpath, bins)¶

Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

class coalispr.bedgraph_analyze.bam_counters.BinCounter(label)¶

Bases: Counter

Class for making data-frame counters, with bin-regions in index.

dtype = 'float'¶

rnd = 2¶

update_strand_count(val, strand, lenreadidx)¶: Set in-read count, ignore lenreadidx)

set_bincounts(cntidx, key)¶

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).

save_to_tsv(tsvpath, bins)¶

Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

class coalispr.bedgraph_analyze.bam_counters.SkipCounter(label)¶

Bases: Counter

Class to keep track of skipped reads with imperfect alignments or more than one hit-index for uncollapsed reads (should be none).

notfram¶

As munrfram or corbfram, but not stranded.

Type:: pd.DataFrame

not_keycntr¶

As munr_keycntr or corb_keycntr, but not stranded.

Type:: pd.Series

notcount¶

Skipped count linked to key (SHORT name for sample).

Type:: int

dtype = 'Int64'¶

set_bincounts(cntidx, key)¶

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).

update_strand_count(val, strand, lenreadidx)¶: Set in-read count

skip_count(val)¶

Use one counter, independent of strand.

Parameters:: int (val =) – Value to add to gathered counts.

report_skipped(key)¶: Indicate how many reads have been skipped for sample key.

get_lencount_frame()¶

get_count_frame()¶: Organise basic counts into dataframe and return this.

save_to_tsv(tsvpath, arg, *args)¶

Save skipped read counts file; use lowercase filename.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
arg (int (when bins); or str (when region); default: None)
*args (could be strand for region counting)

class coalispr.bedgraph_analyze.bam_counters.LengthCounter(label)¶

Bases: Counter

Class for making data-frame counters cataloging lengths.

Notes

pd.DataFrame is used instead of collections.Counter in order to keep float nature of counts for multimapped reads

munr_keycntr¶

Series to hold munro-strand associated counts with region info as index; named to SHORT name for counted sample; reset when merged to munrfram and another sample is counted.

Type:: pd.Series

corb_keycntr¶

Series with all corbett-strand associated counts for each key.

Type:: pd.Series

idxnam¶

Name for index column when dataframes get saved to TSV.

Type:: str

cntlabl¶

Label to mark type of counts in saved file name.

Type:: str

dtype¶

rnd = 2¶

set_bincounts(cntidx, key)¶

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).

merge_lencounters(key)¶

Merge key_length counters to count_frames then reset former.

Notes

Include .astype(self.dtype) at very end, not here, to have set type; otherwise a type change from merging series with non-overlapping indices causes a TypeError when summing row-values (cast to int64 (from count) and <NA> ("Int64"), or to NaN (float) and Float64 (from "Int64")).

Parameters:: key (str) – SHORT name of sample as column index (key) to collect counts for.

update_strand_count(val, strand, lenreadidx)¶

Forward strand count to Series counter.

Parameters:

val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)¶

save_to_tsv(tsvpath, bins=None)¶

Save length frames for separate strands and samples, then for all samples combined. Use lowercase filenames.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

class coalispr.bedgraph_analyze.bam_counters.MultiMapCounter(label)¶

Bases: LengthCounter

Class for making data-frame counters cataloging multimappers.

cntlabl = 'multimapper'¶

idxnam = 'repeats'¶

class coalispr.bedgraph_analyze.bam_counters.RegionLengthCounter(label)¶

Bases: Counter

Class for making data-frame counters cataloging region counts without strand information.

keycntr¶

Series to hold counts with region info as index; named to SHORT name for counted sample; reset when merged, for counting another sample.

Type:: pd.Series

idxnam¶

Name for index column when dataframes get saved to TSV.

Type:: str

cntlabl¶

Label to mark type of counts in saved file name.

Type:: str

dtype¶

rnd = 2¶

cntlabl = 'readlength_counts'¶

idxnam = 'start length'¶

set_bincounts(cntidx, key)¶

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).

merge_lencounters(key)¶

Merge key_length counters to count_frames then reset former.

Parameters:: key (str) – SHORT name of sample as column index (key) to collect counts for.

update_strand_count(val, strand, lenreadidx)¶

Forward strand count to Series counter; ignore strand.

Parameters:

val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from; not used.
lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)¶

get_lencount_frame()¶

get_count_frame()¶: Generate total counts for region by summing length counts.

save_to_tsv(tsvpath, region, strand)¶

Save length frames for separate samples. Use lowercase filenames.

Parameters:

tsvpath (Path) – Path to folder to store TSV file with count information.
region (str) – Formatted descriptor of genome span counted.
strand (str) – One of COMBI, PLUS or MINUS;

exception coalispr.bedgraph_analyze.bam_counters.ZerosFrameException¶

Bases: Exception

Common base class for all non-exit exceptions.