coalispr.bedgraph_analyze.bam_sample_counter_counters¶

Module with counters for extracting information from bamfiles.

import pandas as pd
from collections import Counter

ctr = Counter()
ctr["a"] += 1
ctr["b"] += 1
ctr["a"] += 1
ctr["a"] += 1
ctr["b"] += 1
ctr["a"] += 1
ctr["c"] += 1

# Create dataframe df from Counter:
df = pd.DataFrame({"ctr": ctr}, )

df
   ctr
a    4
b    2
c    1

Use collections.Counter instead of pd.DataFrame when only one sample/column is relevant for each different counter.

Attributes¶

logger

Classes¶

`Bam_sample_base_controller`	Interface for counting; calls counter functions.
`Bam_sample_counter_controller`	Interface for counting; calls counter functions.
`Bam_sample_region_counter_controller`	Interface for counting particular regions in a sample; calls counter
`Sample_counter`	Interface class for making counters to build dataframes from.
`Sample_bin_counter`	Class for making dict counters, with bin-regions in keys for counts.
`Sample_skip_counter`	Class to keep track of skipped reads with imperfect alignments
`Sample_length_counter`	Class for making data-frame counters cataloging lengths.
`Sample_multimap_counter`	Class for making data-frame counters cataloging multimappers.
`Sample_region_length_counter`	Counters cataloging region counts without strand

Module Contents¶

coalispr.bedgraph_analyze.bam_sample_counter_counters.logger¶

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_base_controller(key)¶

Interface for counting; calls counter functions.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here.

::

# bin counters CNTREAD = [LIBR, UNIQ, XTRA, UNSEL]

# MULMAP = LIBR - UNIQ

CNTCDNA = [COLLR, UNIQ+COLLR]: # COLLR+MULMAP = COLLR - (UNIQ+COLLR)

CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP]

CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

# readlength counters LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS = [LENREAD, LENCDNA, LENGAP]

# multimapper counters MMAPCNTRS = [ [LIBR, INTR] ]

For multiprocessing fork out controllers with counters for separate bamfiles, which is per sample (key, i.e its SHORT name as column header). Thus all dataframes from counters for each controller are to be merged to overall dataframes for each type, with a key for each sample.

Approach: create sample controllers with their set of counters to be used in parallel count processing (multiprocessing) and handed over to the main controller which should not need to be shared between either of the sub_processes.

The main controller, Bam_samples_collator, fuses counts series of each Bam_sample_controller (this class) into one dataframe, and does the saving to tsv.

key¶

SHORT name representing experiment/library/sample to count, will be column header in overall dataframes.

Type:: str

binners¶

Type:: dict

sizers¶

Type:: dict

mmappers¶

Dictionaries to organize counters according to count type

Type:: dict

key¶

binners = None¶

sizers = None¶

mmappers = None¶

get_sample_counters()¶

set_bincounts(cntidx)¶

Update bincounters with key-linked region counts.

Parameters:

cntidx (tuple) – (region, binno, binregion)
key (str) – Name of the Series (sample that is counted)

skip_count(val=1)¶: Add to SkipCounter only.

report_skipped()¶: Feedback on missed counts for each sample/key.

update_strand_count(label, val, strand, lenreadidx)¶

Set in-read count for given counter

Parameters:

label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(label, val, strand, nhreadidx)¶

Set count for multimapper hit-number (NH) for given counter.

Parameters:

label (str) – Name for the counter to update.
val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
nhreadidx (index) – Index describing hit/repeat-number to add counts to.

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_counter_controller(key)¶

Bases: Bam_sample_base_controller

Interface for counting; calls counter functions.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here.

::

# bin counters CNTREAD = [LIBR, UNIQ, XTRA, UNSEL]

# MULMAP = LIBR - UNIQ

CNTCDNA = [COLLR, UNIQ+COLLR]: # COLLR+MULMAP = COLLR - (UNIQ+COLLR)

CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP]

CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

# readlength counters LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS = [LENREAD, LENCDNA, LENGAP]

# multimapper counters MMAPCNTRS = [ [LIBR, INTR] ]

The main controller, Bam_samples_collator, fuses counts series of each Bam_sample_controller (this class) into one dataframe, and does the saving to tsv.

key¶

SHORT name representing experiment/library/sample to count, will be column header in overall dataframes.

Type:: str

binners¶

Type:: dict

sizers¶

Type:: dict

mmappers¶

Dictionaries to organize counters according to count type

Type:: dict

binners¶

sizers¶

mmappers¶

get_sample_counters()¶

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_region_counter_controller(key, region, comparereads)¶

Bases: Bam_sample_base_controller

Interface for counting particular regions in a sample; calls counter functions.

key¶

SHORT name for sample, for retrieval/archiving sequences.

Type:: str

region¶

Chromosme coordinates that define region with mapped sequences to analyze.

Type:: str

comparereads¶

List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]

Type:: list

binners¶

Dictionary to store counters, here only for skipped reads.

Type:: dict

sizers¶

Dictionary to store length counters which also hel to get overall counts.

Type:: dict

Notes

Of the counter groups defined in ‘2_shared.txt` (SHARED) needed here are:

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]

region¶

comparereads¶

binners¶

sizers¶

get_sample_counters()¶

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_counter(label, key)¶

Interface class for making counters to build dataframes from.

Notes

Is a collections.Counter faster than a pandas.DataFrame used as counter in view of fractional counts for multimappers? Strand-specific counting needs to be facilitated too.

label¶

Name for BinCounter, from the CNTRS list

Type:: str

key¶

Short name for sample/experiment/label; used as column-header & Counter name.

Type:: str

label¶

key¶

update_strand_count(val, strand, lenreadidx)¶: Set in-read count

set_bincounts(cntidx)¶

Set region-index for self.key (Name of the Series i.e. sample that is counted) to stranded counts; prepare for next round by resetting strand count.

Parameters:: cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_bin_counter(label, key)¶

Bases: Sample_counter

Class for making dict counters, with bin-regions in keys for counts.

munrcntr¶

Dict to hold counts for reads on munro-strand, divided over regions and bins (as key/index) with SHORT name as column key/counter name.

Type:: collections.Counter

corbcntr¶

Dict to hold counts for reads on corbett-strand; as munrcntr but holding counts for reads from opposite strand.

Type:: collections.Counter

munrcount¶

Tracks munro-strand associated counts linked to region iterated; reset when a new region is iterated over.

Type:: int or float

corbcount¶

Tracks corbett-strand associated counts for region iterated

Type:: int or float

munrcntr¶

corbcntr¶

update_strand_count(val, strand, lenreadidx)¶: Set in-read count, ignore lenreadidx)

set_bincounts(cntidx)¶

Set region-index for self(key) to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_skip_counter(label, key)¶

Bases: Sample_counter

Class to keep track of skipped reads with imperfect alignments or more than one hit-index for uncollapsed reads (should be none).

notfram¶

As munrfram or corbfram, but not stranded.

Type:: pd.DataFrame

not_keycntr¶

As munr_keycntr or corb_keycntr, but not stranded.

Type:: pd.Series

notcount¶

Skipped count linked to key (SHORT name for sample).

Type:: int

notcntr¶

set_bincounts(cntidx)¶

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).
key (str) – Name of the Series (sample that is counted).

update_strand_count(val, strand, lenreadidx)¶: Set in-read count

skip_count(val)¶

Use one counter, independent of strand.

Parameters:: int (val =) – Value to add to gathered counts.

report_skipped()¶: Indicate how many reads have been skipped for this sample.

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_length_counter(label, key)¶

Bases: Sample_counter

Class for making data-frame counters cataloging lengths.

Notes

pd.DataFrame is used instead of collections.Counter in order to keep float nature of counts for multimapped reads

idxnam¶

Name for index column when dataframes get saved to TSV.

Type:: str

cntlabl¶

Label to mark type of counts in saved file name.

Type:: str

update_strand_count(val, strand, lenreadidx)¶

Forward strand count to Series counter.

Parameters:

val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand where reads come from (MUNR or CORB).
lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)¶

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_multimap_counter(label, key)¶

Bases: Sample_length_counter

Class for making data-frame counters cataloging multimappers.

cntlabl = 'multimapper'¶

idxnam = 'repeats'¶

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_region_length_counter(label, key)¶

Bases: Sample_counter

Counters cataloging region counts without strand information.

keycntr¶

Dictionary to hold counts with region info as index; specific for: SHORT (key) name for counted sample; reset for counting another

sample.

Type:: collections.Counter

set_bincounts(cntidx)¶

Set region-index for self.key (Name of the Series i.e. sample that is counted) to stranded counts; prepare for next round by resetting strand count.

Parameters:: cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

update_strand_count(val, strand, lenreadidx)¶

Forward strand count to key-counter; ignore strand.

Parameters:

val (int or float) – Value to add to gathered counts.
strand (str) – Name for strand (MUNR/CORB) where reads come from; not used.
lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)¶