coalispr.bedgraph_analyze.bam_counters

Module with counters for extracting information from bamfiles.

Attributes

Exceptions

ZerosFrameException

Common base class for all non-exit exceptions.

Classes

BamCounterController

Interface for counting; calls counter functions.

BamRegionCountController

Interface for counting particular regions; calls counter functions.

Counter

Interface class for making data-frame counters.

BinCounter

Class for making data-frame counters, with bin-regions in index.

SkipCounter

Class to keep track of skipped reads with imperfect alignments

LengthCounter

Class for making data-frame counters cataloging lengths.

MultiMapCounter

Class for making data-frame counters cataloging multimappers.

RegionLengthCounter

Class for making data-frame counters cataloging region counts

Functions

set_cols(colslist)

Define the list of samples that are counted.

get_cols()

Return the list of samples for which counts are gathered.

count_frame([dtype])

Base dataframe with multi-index and column-keys to store counts.

Module Contents

coalispr.bedgraph_analyze.bam_counters.logger
coalispr.bedgraph_analyze.bam_counters.COLS = []
coalispr.bedgraph_analyze.bam_counters.set_cols(colslist)

Define the list of samples that are counted.

Notes

The colslist is defined by functions (process_bamfiles or process_reads_for_region) creating the count controllers (as beancounter), in coalispr.bedgraph_analyze.process_bamdata.

Parameters:

colslist (list) – List of samples that form column index of dataframes storing the counts.

coalispr.bedgraph_analyze.bam_counters.get_cols()

Return the list of samples for which counts are gathered.

coalispr.bedgraph_analyze.bam_counters.count_frame(dtype=float)

Base dataframe with multi-index and column-keys to store counts.

Parameters:

dtype (pandas.dtype) – Datatype for dataframe

Return type:

Empty dataframe to be filled when processing bam file.

class coalispr.bedgraph_analyze.bam_counters.BamCounterController

Interface for counting; calls counter functions.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here.

CNTREAD        = [LIBR, UNIQ, XTRA, UNSEL]  # MULMAP = LIBR - UNIQ
CNTCDNA        = [COLLR, UNIQ+COLLR]        # COLLR+MULMAP = COLLR - (UNIQ+COLLR)
CNTGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]
#CNTSKIP        = [SKIP]

CNTRS          = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

LENREAD        = [LIBR, UNIQ, XTRA, UNSEL]
LENCDNA        = [COLLR, UNIQ+COLLR]
LENGAP         = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS       = [LENREAD, LENCDNA, LENGAP]

MMAPCNTRS      = [ [LIBR, INTR] ]
binners
sizers
mmappers
set_bincounts(cntidx, key)

Update bincounters with key-linked region counts.

Parameters:
  • cntidx (tuple) – (region, binno, binregion)

  • key (str) – Name of the Series (sample that is counted)

merge_lencounters(key)

Update length counter frames with key-linked info.

Parameters:

key (str) – Name of the Series (sample that is counted)

skip_count(val=1)

Add to SkipCounter only.

report_skipped(key)

Feedback on missed counts for each sample/key.

update_strand_count(label, val, strand, lenreadidx)

Set in-read count for given counter

Parameters:
  • label (str) – Name for the counter to update.

  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand where reads come from (MUNR or CORB).

  • lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(label, val, strand, nhreadidx)

Set count for multimapper hit-number (NH) for given counter.

Parameters:
  • label (str) – Name for the counter to update.

  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand where reads come from (MUNR or CORB).

  • nhreadidx (index) – Index describing hit/repeat-number to add counts to.

save_to_tsv(tsvpath, bins)

Save the counts to TSV files.

Parameters:
  • tsvpath (Path) – Path for folder to store files

  • bins (int) – Number of sections a counted region is split into.

class coalispr.bedgraph_analyze.bam_counters.BamRegionCountController(region, comparereads, strand)

Bases: BamCounterController

Interface for counting particular regions; calls counter functions.

Notes

Of the counter groups defined in ‘2_shared.txt` (SHARED) needed here are:

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]
region
comparereads
strand
binners
sizers
mmappers = None
merge_lencounters(key)

Update length counter frames with key-linked info.

Parameters:

key (str) – Name of the Series (sample that is counted)

save_to_tsv(tsvpath, region, strand)

Save the counts to TSV files.

Parameters:
  • tsvpath (Path) – Path for folder to store files

  • region (str) – Formatted descriptor for counted genome span; f”{chrnam}_{region[0]}-{region[1]}”.

  • strand (str) – One of COMBI, PLUS or MINUS;

get_lencount_frames()

Get dataframes with counts.

get_count_frames()

Get dataframe with counts. Add multimappers.

class coalispr.bedgraph_analyze.bam_counters.Counter(label)

Interface class for making data-frame counters.

Notes

Dataframes used as counters in view of fractional counts for multimappers. Strand-specific counting can be facilitated.

label

Name for BinCounter, from the CNTRS list

Type:

str

dtype

Can be int, "Int64" (pd.Int64Dtype()) or float. To let pandas choose appropriate Numpy format use int and float; nullable integer "Int64" takes pd.NA for missing value instead of NaN (dtype float) after merging frames without perfect index overlap (see https://pandas.pydata.org/docs/user_guide/integer_na.html). For rounding, with float and Int64, calling astype(dtype) is not needed at end for bin-counters but needed for length-counters or in the case of int before saving to file. The float_format function (commented out) could be used instead. Pre-indexed frames do not lead to a speed-up.

Parameters:
  • munrfram (pd.DataFrame) – Frame to hold counts for reads on munro-strand, divided over regions and bins (as index) with SHORT names as column keys.

  • corbfram (pd.DataFrame) – As munrfram but holding counts for reads from opposite strand.

  • munrcount (int or float) – Tracks munro-strand associated counts linked to region iterated; reset when a new region is iterated over.

  • corbcount (int or float) – Tracks corbett-strand associated counts for region iterated

label
update_strand_count(val, strand, lenreadidx)

Set in-read count

set_bincounts(cntidx, key)

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:
  • cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

  • key (str) – Name of the Series (sample that is counted).

merge_lencounters(key)

Merge counters; pass here, only valid for LengthCounter.

save_to_tsv(tsvpath, bins)

Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

class coalispr.bedgraph_analyze.bam_counters.BinCounter(label)

Bases: Counter

Class for making data-frame counters, with bin-regions in index.

dtype = 'float'
rnd = 2
update_strand_count(val, strand, lenreadidx)

Set in-read count, ignore lenreadidx)

set_bincounts(cntidx, key)

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:
  • cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

  • key (str) – Name of the Series (sample that is counted).

save_to_tsv(tsvpath, bins)

Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

class coalispr.bedgraph_analyze.bam_counters.SkipCounter(label)

Bases: Counter

Class to keep track of skipped reads with imperfect alignments or more than one hit-index for uncollapsed reads (should be none).

notfram

As munrfram or corbfram, but not stranded.

Type:

pd.DataFrame

not_keycntr

As munr_keycntr or corb_keycntr, but not stranded.

Type:

pd.Series

notcount

Skipped count linked to key (SHORT name for sample).

Type:

int

dtype = 'Int64'
set_bincounts(cntidx, key)

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:
  • cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

  • key (str) – Name of the Series (sample that is counted).

update_strand_count(val, strand, lenreadidx)

Set in-read count

skip_count(val)

Use one counter, independent of strand.

Parameters:

int (val =) – Value to add to gathered counts.

report_skipped(key)

Indicate how many reads have been skipped for sample key.

get_lencount_frame()
get_count_frame()

Organise basic counts into dataframe and return this.

save_to_tsv(tsvpath, arg, *args)

Save skipped read counts file; use lowercase filename.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • arg (int (when bins); or str (when region); default: None)

  • *args (could be strand for region counting)

class coalispr.bedgraph_analyze.bam_counters.LengthCounter(label)

Bases: Counter

Class for making data-frame counters cataloging lengths.

Notes

pd.DataFrame is used instead of collections.Counter in order to keep float nature of counts for multimapped reads

munr_keycntr

Series to hold munro-strand associated counts with region info as index; named to SHORT name for counted sample; reset when merged to munrfram and another sample is counted.

Type:

pd.Series

corb_keycntr

Series with all corbett-strand associated counts for each key.

Type:

pd.Series

idxnam

Name for index column when dataframes get saved to TSV.

Type:

str

cntlabl

Label to mark type of counts in saved file name.

Type:

str

dtype
rnd = 2
set_bincounts(cntidx, key)

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:
  • cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

  • key (str) – Name of the Series (sample that is counted).

merge_lencounters(key)

Merge key_length counters to count_frames then reset former.

Notes

Include .astype(self.dtype) at very end, not here, to have set type; otherwise a type change from merging series with non-overlapping indices causes a TypeError when summing row-values (cast to int64 (from count) and <NA> ("Int64"), or to NaN (float) and Float64 (from "Int64")).

Parameters:

key (str) – SHORT name of sample as column index (key) to collect counts for.

update_strand_count(val, strand, lenreadidx)

Forward strand count to Series counter.

Parameters:
  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand where reads come from (MUNR or CORB).

  • lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)
save_to_tsv(tsvpath, bins=None)

Save length frames for separate strands and samples, then for all samples combined. Use lowercase filenames.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).

class coalispr.bedgraph_analyze.bam_counters.MultiMapCounter(label)

Bases: LengthCounter

Class for making data-frame counters cataloging multimappers.

cntlabl = 'multimapper'
idxnam = 'repeats'
class coalispr.bedgraph_analyze.bam_counters.RegionLengthCounter(label)

Bases: Counter

Class for making data-frame counters cataloging region counts without strand information.

keycntr

Series to hold counts with region info as index; named to SHORT name for counted sample; reset when merged, for counting another sample.

Type:

pd.Series

idxnam

Name for index column when dataframes get saved to TSV.

Type:

str

cntlabl

Label to mark type of counts in saved file name.

Type:

str

dtype
rnd = 2
cntlabl = 'readlength_counts'
idxnam = 'start length'
set_bincounts(cntidx, key)

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:
  • cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

  • key (str) – Name of the Series (sample that is counted).

merge_lencounters(key)

Merge key_length counters to count_frames then reset former.

Parameters:

key (str) – SHORT name of sample as column index (key) to collect counts for.

update_strand_count(val, strand, lenreadidx)

Forward strand count to Series counter; ignore strand.

Parameters:
  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand where reads come from; not used.

  • lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)
get_lencount_frame()
get_count_frame()

Generate total counts for region by summing length counts.

save_to_tsv(tsvpath, region, strand)

Save length frames for separate samples. Use lowercase filenames.

Parameters:
  • tsvpath (Path) – Path to folder to store TSV file with count information.

  • region (str) – Formatted descriptor of genome span counted.

  • strand (str) – One of COMBI, PLUS or MINUS;

exception coalispr.bedgraph_analyze.bam_counters.ZerosFrameException

Bases: Exception

Common base class for all non-exit exceptions.