coalispr.bedgraph_analyze.bam_counters¶
Module with counters for extracting information from bamfiles.
Attributes¶
Exceptions¶
Common base class for all non-exit exceptions. |
Classes¶
Interface for counting; calls counter functions. |
|
Interface for counting particular regions; calls counter functions. |
|
Interface class for making data-frame counters. |
|
Class for making data-frame counters, with bin-regions in index. |
|
Class to keep track of skipped reads with imperfect alignments |
|
Class for making data-frame counters cataloging lengths. |
|
Class for making data-frame counters cataloging multimappers. |
|
Class for making data-frame counters cataloging region counts |
Functions¶
|
Define the list of samples that are counted. |
|
Return the list of samples for which counts are gathered. |
|
Base dataframe with multi-index and column-keys to store counts. |
Module Contents¶
- coalispr.bedgraph_analyze.bam_counters.logger¶
- coalispr.bedgraph_analyze.bam_counters.COLS = []¶
- coalispr.bedgraph_analyze.bam_counters.set_cols(colslist)¶
Define the list of samples that are counted.
Notes
The
colslist
is defined by functions (process_bamfiles
orprocess_reads_for_region
) creating the count controllers (as beancounter), incoalispr.bedgraph_analyze.process_bamdata
.- Parameters:
colslist (list) – List of samples that form column index of dataframes storing the counts.
- coalispr.bedgraph_analyze.bam_counters.get_cols()¶
Return the list of samples for which counts are gathered.
- coalispr.bedgraph_analyze.bam_counters.count_frame(dtype=float)¶
Base dataframe with multi-index and column-keys to store counts.
- Parameters:
dtype (pandas.dtype) – Datatype for dataframe
- Return type:
Empty dataframe to be filled when processing bam file.
- class coalispr.bedgraph_analyze.bam_counters.BamCounterController¶
Interface for counting; calls counter functions.
Notes
Counter groups defined in
2_shared.txt
(SHARED) and3_EXP.txt
(EXPTXT) are used here.CNTREAD = [LIBR, UNIQ, XTRA, UNSEL] # MULMAP = LIBR - UNIQ CNTCDNA = [COLLR, UNIQ+COLLR] # COLLR+MULMAP = COLLR - (UNIQ+COLLR) CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP] CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ] LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] LENCNTRS = [LENREAD, LENCDNA, LENGAP] MMAPCNTRS = [ [LIBR, INTR] ]
- binners¶
- sizers¶
- mmappers¶
- set_bincounts(cntidx, key)¶
Update bincounters with key-linked region counts.
- Parameters:
cntidx (tuple) – (region, binno, binregion)
key (str) – Name of the Series (sample that is counted)
- merge_lencounters(key)¶
Update length counter frames with key-linked info.
- Parameters:
key (str) – Name of the Series (sample that is counted)
- skip_count(val=1)¶
Add to SkipCounter only.
- report_skipped(key)¶
Feedback on missed counts for each sample/key.
- update_strand_count(label, val, strand, lenreadidx)¶
Set in-read count for given counter
- Parameters:
label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.
- update_multimap_count(label, val, strand, nhreadidx)¶
Set count for multimapper hit-number (NH) for given counter.
- Parameters:
label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
nhreadidx (index) – Index describing hit/repeat-number to add counts to.
- save_to_tsv(tsvpath, bins)¶
Save the counts to TSV files.
- Parameters:
tsvpath (Path) – Path for folder to store files
bins (int) – Number of sections a counted region is split into.
- class coalispr.bedgraph_analyze.bam_counters.BamRegionCountController(region, comparereads, strand)¶
Bases:
BamCounterController
Interface for counting particular regions; calls counter functions.
Notes
Of the counter groups defined in ‘2_shared.txt` (SHARED) needed here are:
REGCNTRS = [ [LIBR, UNIQ], [COLLR, UNIQ+COLLR] , CNTSKIP ]
- region¶
- comparereads¶
- strand¶
- binners¶
- sizers¶
- mmappers = None¶
- merge_lencounters(key)¶
Update length counter frames with key-linked info.
- Parameters:
key (str) – Name of the Series (sample that is counted)
- save_to_tsv(tsvpath, region, strand)¶
Save the counts to TSV files.
- Parameters:
tsvpath (Path) – Path for folder to store files
region (str) – Formatted descriptor for counted genome span; f”{chrnam}_{region[0]}-{region[1]}”.
strand (str) – One of COMBI, PLUS or MINUS;
- get_lencount_frames()¶
Get dataframes with counts.
- get_count_frames()¶
Get dataframe with counts. Add multimappers.
- class coalispr.bedgraph_analyze.bam_counters.Counter(label)¶
Interface class for making data-frame counters.
Notes
Dataframes used as counters in view of fractional counts for multimappers. Strand-specific counting can be facilitated.
- label¶
Name for BinCounter, from the CNTRS list
- Type:
str
- dtype¶
Can be
int
,"Int64"
(pd.Int64Dtype()
) orfloat
. To let pandas choose appropriate Numpy format useint
andfloat
; nullable integer"Int64"
takespd.NA
for missing value instead ofNaN
(dtypefloat
) after merging frames without perfect index overlap (see https://pandas.pydata.org/docs/user_guide/integer_na.html). For rounding, withfloat
andInt64
, callingastype(dtype)
is not needed at end for bin-counters but needed for length-counters or in the case ofint
before saving to file. Thefloat_format
function (commented out) could be used instead. Pre-indexed frames do not lead to a speed-up.
- Parameters:
munrfram (pd.DataFrame) – Frame to hold counts for reads on munro-strand, divided over regions and bins (as index) with SHORT names as column keys.
corbfram (pd.DataFrame) – As munrfram but holding counts for reads from opposite strand.
munrcount (int or float) – Tracks munro-strand associated counts linked to region iterated; reset when a new region is iterated over.
corbcount (int or float) – Tracks corbett-strand associated counts for region iterated
- label¶
- update_strand_count(val, strand, lenreadidx)¶
Set in-read count
- set_bincounts(cntidx, key)¶
Set region-index for key to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).
- merge_lencounters(key)¶
Merge counters;
pass
here, only valid forLengthCounter
.
- save_to_tsv(tsvpath, bins)¶
Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).
- class coalispr.bedgraph_analyze.bam_counters.BinCounter(label)¶
Bases:
Counter
Class for making data-frame counters, with bin-regions in index.
- dtype = 'float'¶
- rnd = 2¶
- update_strand_count(val, strand, lenreadidx)¶
Set in-read count, ignore lenreadidx)
- set_bincounts(cntidx, key)¶
Set region-index for key to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).
- save_to_tsv(tsvpath, bins)¶
Save stranded and combined counts to files; sum over bins and save summed counts to separate files; use lowercase filenames.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).
- class coalispr.bedgraph_analyze.bam_counters.SkipCounter(label)¶
Bases:
Counter
Class to keep track of skipped reads with imperfect alignments or more than one hit-index for uncollapsed reads (should be none).
- notfram¶
As munrfram or corbfram, but not stranded.
- Type:
pd.DataFrame
- not_keycntr¶
As munr_keycntr or corb_keycntr, but not stranded.
- Type:
pd.Series
- notcount¶
Skipped count linked to key (SHORT name for sample).
- Type:
int
- dtype = 'Int64'¶
- set_bincounts(cntidx, key)¶
Set region-index for key to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).
- update_strand_count(val, strand, lenreadidx)¶
Set in-read count
- skip_count(val)¶
Use one counter, independent of strand.
- Parameters:
int (val =) – Value to add to gathered counts.
- report_skipped(key)¶
Indicate how many reads have been skipped for sample
key
.
- get_lencount_frame()¶
- get_count_frame()¶
Organise basic counts into dataframe and return this.
- save_to_tsv(tsvpath, arg, *args)¶
Save skipped read counts file; use lowercase filename.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
arg (int (when bins); or str (when region); default: None)
*args (could be strand for region counting)
- class coalispr.bedgraph_analyze.bam_counters.LengthCounter(label)¶
Bases:
Counter
Class for making data-frame counters cataloging lengths.
Notes
pd.DataFrame
is used instead ofcollections.Counter
in order to keep float nature of counts for multimapped reads- munr_keycntr¶
Series to hold munro-strand associated counts with region info as index; named to SHORT name for counted sample; reset when merged to munrfram and another sample is counted.
- Type:
pd.Series
- corb_keycntr¶
Series with all corbett-strand associated counts for each key.
- Type:
pd.Series
- idxnam¶
Name for index column when dataframes get saved to TSV.
- Type:
str
- cntlabl¶
Label to mark type of counts in saved file name.
- Type:
str
- dtype¶
- rnd = 2¶
- set_bincounts(cntidx, key)¶
Set region-index for key to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).
- merge_lencounters(key)¶
Merge key_length counters to count_frames then reset former.
Notes
Include
.astype(self.dtype)
at very end, not here, to have set type; otherwise a type change from merging series with non-overlapping indices causes aTypeError
when summing row-values (cast toint64
(from count) and <NA> ("Int64"
), or toNaN
(float
) andFloat64
(from"Int64"
)).- Parameters:
key (str) – SHORT name of sample as column index (key) to collect counts for.
- update_strand_count(val, strand, lenreadidx)¶
Forward strand count to Series counter.
- Parameters:
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.
- update_multimap_count(val, strand, lenreadidx)¶
- save_to_tsv(tsvpath, bins=None)¶
Save length frames for separate strands and samples, then for all samples combined. Use lowercase filenames.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
bins (int (default: None)) – Number of bins to split segment with counts into to detect preferential accumulation of reads (near 5’ or 3’ for example).
- class coalispr.bedgraph_analyze.bam_counters.MultiMapCounter(label)¶
Bases:
LengthCounter
Class for making data-frame counters cataloging multimappers.
- cntlabl = 'multimapper'¶
- idxnam = 'repeats'¶
- class coalispr.bedgraph_analyze.bam_counters.RegionLengthCounter(label)¶
Bases:
Counter
Class for making data-frame counters cataloging region counts without strand information.
- keycntr¶
Series to hold counts with region info as index; named to SHORT name for counted sample; reset when merged, for counting another sample.
- Type:
pd.Series
- idxnam¶
Name for index column when dataframes get saved to TSV.
- Type:
str
- cntlabl¶
Label to mark type of counts in saved file name.
- Type:
str
- dtype¶
- rnd = 2¶
- cntlabl = 'readlength_counts'¶
- idxnam = 'start length'¶
- set_bincounts(cntidx, key)¶
Set region-index for key to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).
- merge_lencounters(key)¶
Merge key_length counters to count_frames then reset former.
- Parameters:
key (str) – SHORT name of sample as column index (key) to collect counts for.
- update_strand_count(val, strand, lenreadidx)¶
Forward strand count to Series counter; ignore strand.
- Parameters:
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from; not used.
lenreadidx (index) – Index describing readlength to add counts to.
- update_multimap_count(val, strand, lenreadidx)¶
- get_lencount_frame()¶
- get_count_frame()¶
Generate total counts for region by summing length counts.
- save_to_tsv(tsvpath, region, strand)¶
Save length frames for separate samples. Use lowercase filenames.
- Parameters:
tsvpath (Path) – Path to folder to store TSV file with count information.
region (str) – Formatted descriptor of genome span counted.
strand (str) – One of COMBI, PLUS or MINUS;
- exception coalispr.bedgraph_analyze.bam_counters.ZerosFrameException¶
Bases:
Exception
Common base class for all non-exit exceptions.