coalispr.bedgraph_analyze.bam_sample_counter_counters¶
Module with counters for extracting information from bamfiles.
import pandas as pd
from collections import Counter
ctr = Counter()
ctr["a"] += 1
ctr["b"] += 1
ctr["a"] += 1
ctr["a"] += 1
ctr["b"] += 1
ctr["a"] += 1
ctr["c"] += 1
# Create dataframe df from Counter:
df = pd.DataFrame({"ctr": ctr}, )
df
ctr
a 4
b 2
c 1
Use collections.Counter instead of pd.DataFrame when only one sample/column is relevant for each different counter.
Attributes¶
Classes¶
Interface for counting; calls counter functions. |
|
Interface for counting; calls counter functions. |
|
Interface for counting particular regions in a sample; calls counter |
|
Interface class for making counters to build dataframes from. |
|
Class for making dict counters, with bin-regions in keys for counts. |
|
Class to keep track of skipped reads with imperfect alignments |
|
Class for making data-frame counters cataloging lengths. |
|
Class for making data-frame counters cataloging multimappers. |
|
Counters cataloging region counts without strand |
Module Contents¶
- coalispr.bedgraph_analyze.bam_sample_counter_counters.logger¶
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_base_controller(key)¶
Interface for counting; calls counter functions.
Notes
Counter groups defined in
2_shared.txt(SHARED) and3_EXP.txt(EXPTXT) are used here.- ::
# bin counters CNTREAD = [LIBR, UNIQ, XTRA, UNSEL]
# MULMAP = LIBR - UNIQ
- CNTCDNA = [COLLR, UNIQ+COLLR]
# COLLR+MULMAP = COLLR - (UNIQ+COLLR)
CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP]
CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]
# readlength counters LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]
LENCNTRS = [LENREAD, LENCDNA, LENGAP]
# multimapper counters MMAPCNTRS = [ [LIBR, INTR] ]
For multiprocessing fork out controllers with counters for separate bamfiles, which is per sample (key, i.e its SHORT name as column header). Thus all dataframes from counters for each controller are to be merged to overall dataframes for each type, with a key for each sample.
Approach: create sample controllers with their set of counters to be used in parallel count processing (multiprocessing) and handed over to the main controller which should not need to be shared between either of the sub_processes.
The main controller, Bam_samples_collator, fuses counts series of each Bam_sample_controller (this class) into one dataframe, and does the saving to tsv.
- key¶
SHORTname representing experiment/library/sample to count, will be column header in overall dataframes.- Type:
str
- binners¶
- Type:
dict
- sizers¶
- Type:
dict
- mmappers¶
Dictionaries to organize counters according to count type
- Type:
dict
- key¶
- binners = None¶
- sizers = None¶
- mmappers = None¶
- get_sample_counters()¶
- set_bincounts(cntidx)¶
Update bincounters with key-linked region counts.
- Parameters:
cntidx (tuple) – (region, binno, binregion)
key (str) – Name of the Series (sample that is counted)
- skip_count(val=1)¶
Add to SkipCounter only.
- report_skipped()¶
Feedback on missed counts for each sample/key.
- update_strand_count(label, val, strand, lenreadidx)¶
Set in-read count for given counter
- Parameters:
label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.
- update_multimap_count(label, val, strand, nhreadidx)¶
Set count for multimapper hit-number (NH) for given counter.
- Parameters:
label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
nhreadidx (index) – Index describing hit/repeat-number to add counts to.
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_counter_controller(key)¶
Bases:
Bam_sample_base_controllerInterface for counting; calls counter functions.
Notes
Counter groups defined in
2_shared.txt(SHARED) and3_EXP.txt(EXPTXT) are used here.- ::
# bin counters CNTREAD = [LIBR, UNIQ, XTRA, UNSEL]
# MULMAP = LIBR - UNIQ
- CNTCDNA = [COLLR, UNIQ+COLLR]
# COLLR+MULMAP = COLLR - (UNIQ+COLLR)
CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP]
CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]
# readlength counters LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]
LENCNTRS = [LENREAD, LENCDNA, LENGAP]
# multimapper counters MMAPCNTRS = [ [LIBR, INTR] ]
For multiprocessing fork out controllers with counters for separate bamfiles, which is per sample (key, i.e its SHORT name as column header). Thus all dataframes from counters for each controller are to be merged to overall dataframes for each type, with a key for each sample.
Approach: create sample controllers with their set of counters to be used in parallel count processing (multiprocessing) and handed over to the main controller which should not need to be shared between either of the sub_processes.
The main controller, Bam_samples_collator, fuses counts series of each Bam_sample_controller (this class) into one dataframe, and does the saving to tsv.
- key¶
SHORTname representing experiment/library/sample to count, will be column header in overall dataframes.- Type:
str
- binners¶
- Type:
dict
- sizers¶
- Type:
dict
- mmappers¶
Dictionaries to organize counters according to count type
- Type:
dict
- binners¶
- sizers¶
- mmappers¶
- get_sample_counters()¶
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_region_counter_controller(key, region, comparereads)¶
Bases:
Bam_sample_base_controllerInterface for counting particular regions in a sample; calls counter functions.
- key¶
SHORT name for sample, for retrieval/archiving sequences.
- Type:
str
- region¶
Chromosme coordinates that define region with mapped sequences to analyze.
- Type:
str
- comparereads¶
List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]
- Type:
list
- binners¶
Dictionary to store counters, here only for skipped reads.
- Type:
dict
- sizers¶
Dictionary to store length counters which also hel to get overall counts.
- Type:
dict
Notes
Of the counter groups defined in ‘2_shared.txt` (SHARED) needed here are:
REGCNTRS = [ [LIBR, UNIQ], [COLLR, UNIQ+COLLR] , CNTSKIP ]
- region¶
- comparereads¶
- binners¶
- sizers¶
- get_sample_counters()¶
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_counter(label, key)¶
Interface class for making counters to build dataframes from.
Notes
Is a collections.Counter faster than a pandas.DataFrame used as counter in view of fractional counts for multimappers? Strand-specific counting needs to be facilitated too.
- label¶
Name for BinCounter, from the CNTRS list
- Type:
str
- key¶
Short name for sample/experiment/label; used as column-header & Counter name.
- Type:
str
- label¶
- key¶
- update_strand_count(val, strand, lenreadidx)¶
Set in-read count
- set_bincounts(cntidx)¶
Set region-index for self.key (Name of the Series i.e. sample that is counted) to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_bin_counter(label, key)¶
Bases:
Sample_counterClass for making dict counters, with bin-regions in keys for counts.
- munrcntr¶
Dict to hold counts for reads on munro-strand, divided over regions and bins (as key/index) with SHORT name as column key/counter name.
- Type:
collections.Counter
- corbcntr¶
Dict to hold counts for reads on corbett-strand; as munrcntr but holding counts for reads from opposite strand.
- Type:
collections.Counter
- munrcount¶
Tracks munro-strand associated counts linked to region iterated; reset when a new region is iterated over.
- Type:
int or float
- corbcount¶
Tracks corbett-strand associated counts for region iterated
- Type:
int or float
- munrcntr¶
- corbcntr¶
- update_strand_count(val, strand, lenreadidx)¶
Set in-read count, ignore lenreadidx)
- set_bincounts(cntidx)¶
Set region-index for self(key) to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_skip_counter(label, key)¶
Bases:
Sample_counterClass to keep track of skipped reads with imperfect alignments or more than one hit-index for uncollapsed reads (should be none).
- notfram¶
As munrfram or corbfram, but not stranded.
- Type:
pd.DataFrame
- not_keycntr¶
As munr_keycntr or corb_keycntr, but not stranded.
- Type:
pd.Series
- notcount¶
Skipped count linked to key (SHORT name for sample).
- Type:
int
- notcntr¶
- set_bincounts(cntidx)¶
Set region-index for key to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).
- update_strand_count(val, strand, lenreadidx)¶
Set in-read count
- skip_count(val)¶
Use one counter, independent of strand.
- Parameters:
int (val =) – Value to add to gathered counts.
- report_skipped()¶
Indicate how many reads have been skipped for this sample.
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_length_counter(label, key)¶
Bases:
Sample_counterClass for making data-frame counters cataloging lengths.
Notes
pd.DataFrameis used instead ofcollections.Counterin order to keep float nature of counts for multimapped reads- idxnam¶
Name for index column when dataframes get saved to TSV.
- Type:
str
- cntlabl¶
Label to mark type of counts in saved file name.
- Type:
str
- update_strand_count(val, strand, lenreadidx)¶
Forward strand count to Series counter.
- Parameters:
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.
- update_multimap_count(val, strand, lenreadidx)¶
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_multimap_counter(label, key)¶
Bases:
Sample_length_counterClass for making data-frame counters cataloging multimappers.
- cntlabl = 'multimapper'¶
- idxnam = 'repeats'¶
- class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_region_length_counter(label, key)¶
Bases:
Sample_counterCounters cataloging region counts without strand information.
- keycntr¶
- Dictionary to hold counts with region info as index; specific for
SHORT (key) name for counted sample; reset for counting another
sample.
- Type:
collections.Counter
- set_bincounts(cntidx)¶
Set region-index for self.key (Name of the Series i.e. sample that is counted) to stranded counts; prepare for next round by resetting strand count.
- Parameters:
cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
- update_strand_count(val, strand, lenreadidx)¶
Forward strand count to key-counter; ignore strand.
- Parameters:
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand (MUNR/CORB) where reads come from; not used.
lenreadidx (index) – Index describing readlength to add counts to.
- update_multimap_count(val, strand, lenreadidx)¶