coalispr.bedgraph_analyze.bam_sample_counter_counters

Module with counters for extracting information from bamfiles.

import pandas as pd
from collections import Counter

ctr = Counter()
ctr["a"] += 1
ctr["b"] += 1
ctr["a"] += 1
ctr["a"] += 1
ctr["b"] += 1
ctr["a"] += 1
ctr["c"] += 1

# Create dataframe df from Counter:
df = pd.DataFrame({"ctr": ctr}, )

df
   ctr
a    4
b    2
c    1

Use collections.Counter instead of pd.DataFrame when only one sample/column is relevant for each different counter.

Attributes

Classes

Bam_sample_base_controller

Interface for counting; calls counter functions.

Bam_sample_counter_controller

Interface for counting; calls counter functions.

Bam_sample_region_counter_controller

Interface for counting particular regions in a sample; calls counter

Sample_counter

Interface class for making counters to build dataframes from.

Sample_bin_counter

Class for making dict counters, with bin-regions in keys for counts.

Sample_skip_counter

Class to keep track of skipped reads with imperfect alignments

Sample_length_counter

Class for making data-frame counters cataloging lengths.

Sample_multimap_counter

Class for making data-frame counters cataloging multimappers.

Sample_region_length_counter

Counters cataloging region counts without strand

Module Contents

coalispr.bedgraph_analyze.bam_sample_counter_counters.logger
class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_base_controller(key)

Interface for counting; calls counter functions.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here.

::

# bin counters CNTREAD = [LIBR, UNIQ, XTRA, UNSEL]

# MULMAP = LIBR - UNIQ

CNTCDNA = [COLLR, UNIQ+COLLR]

# COLLR+MULMAP = COLLR - (UNIQ+COLLR)

CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP]

CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

# readlength counters LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS = [LENREAD, LENCDNA, LENGAP]

# multimapper counters MMAPCNTRS = [ [LIBR, INTR] ]

For multiprocessing fork out controllers with counters for separate bamfiles, which is per sample (key, i.e its SHORT name as column header). Thus all dataframes from counters for each controller are to be merged to overall dataframes for each type, with a key for each sample.

Approach: create sample controllers with their set of counters to be used in parallel count processing (multiprocessing) and handed over to the main controller which should not need to be shared between either of the sub_processes.

The main controller, Bam_samples_collator, fuses counts series of each Bam_sample_controller (this class) into one dataframe, and does the saving to tsv.

key

SHORT name representing experiment/library/sample to count, will be column header in overall dataframes.

Type:

str

binners
Type:

dict

sizers
Type:

dict

mmappers

Dictionaries to organize counters according to count type

Type:

dict

key
binners = None
sizers = None
mmappers = None
get_sample_counters()
set_bincounts(cntidx)

Update bincounters with key-linked region counts.

Parameters:
  • cntidx (tuple) – (region, binno, binregion)

  • key (str) – Name of the Series (sample that is counted)

skip_count(val=1)

Add to SkipCounter only.

report_skipped()

Feedback on missed counts for each sample/key.

update_strand_count(label, val, strand, lenreadidx)

Set in-read count for given counter

Parameters:
  • label (str) – Name for the counter to update.

  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand where reads come from (MUNR or CORB).

  • lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(label, val, strand, nhreadidx)

Set count for multimapper hit-number (NH) for given counter.

Parameters:
  • label (str) – Name for the counter to update.

  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand where reads come from (MUNR or CORB).

  • nhreadidx (index) – Index describing hit/repeat-number to add counts to.

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_counter_controller(key)

Bases: Bam_sample_base_controller

Interface for counting; calls counter functions.

Notes

Counter groups defined in 2_shared.txt (SHARED) and 3_EXP.txt (EXPTXT) are used here.

::

# bin counters CNTREAD = [LIBR, UNIQ, XTRA, UNSEL]

# MULMAP = LIBR - UNIQ

CNTCDNA = [COLLR, UNIQ+COLLR]

# COLLR+MULMAP = COLLR - (UNIQ+COLLR)

CNTGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR] #CNTSKIP = [SKIP]

CNTRS = [CNTREAD, CNTCDNA, CNTGAP, [SKIP] ]

# readlength counters LENREAD = [LIBR, UNIQ, XTRA, UNSEL] LENCDNA = [COLLR, UNIQ+COLLR] LENGAP = [INTR, INTR+COLLR, UNIQ+INTR, UNIQ+INTR+COLLR]

LENCNTRS = [LENREAD, LENCDNA, LENGAP]

# multimapper counters MMAPCNTRS = [ [LIBR, INTR] ]

For multiprocessing fork out controllers with counters for separate bamfiles, which is per sample (key, i.e its SHORT name as column header). Thus all dataframes from counters for each controller are to be merged to overall dataframes for each type, with a key for each sample.

Approach: create sample controllers with their set of counters to be used in parallel count processing (multiprocessing) and handed over to the main controller which should not need to be shared between either of the sub_processes.

The main controller, Bam_samples_collator, fuses counts series of each Bam_sample_controller (this class) into one dataframe, and does the saving to tsv.

key

SHORT name representing experiment/library/sample to count, will be column header in overall dataframes.

Type:

str

binners
Type:

dict

sizers
Type:

dict

mmappers

Dictionaries to organize counters according to count type

Type:

dict

binners
sizers
mmappers
get_sample_counters()
class coalispr.bedgraph_analyze.bam_sample_counter_counters.Bam_sample_region_counter_controller(key, region, comparereads)

Bases: Bam_sample_base_controller

Interface for counting particular regions in a sample; calls counter functions.

key

SHORT name for sample, for retrieval/archiving sequences.

Type:

str

region

Chromosme coordinates that define region with mapped sequences to analyze.

Type:

str

comparereads

List of labels to indicate kind of reads to compare against each other. [LIBR, UNIQ, MULMAP+LIBR] [COLLR, UNIQ+COLLR, MULMAP+COLLR]

Type:

list

binners

Dictionary to store counters, here only for skipped reads.

Type:

dict

sizers

Dictionary to store length counters which also hel to get overall counts.

Type:

dict

Notes

Of the counter groups defined in ‘2_shared.txt` (SHARED) needed here are:

REGCNTRS       = [ [LIBR, UNIQ],  [COLLR, UNIQ+COLLR] , CNTSKIP ]
region
comparereads
binners
sizers
get_sample_counters()
class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_counter(label, key)

Interface class for making counters to build dataframes from.

Notes

Is a collections.Counter faster than a pandas.DataFrame used as counter in view of fractional counts for multimappers? Strand-specific counting needs to be facilitated too.

label

Name for BinCounter, from the CNTRS list

Type:

str

key

Short name for sample/experiment/label; used as column-header & Counter name.

Type:

str

label
key
update_strand_count(val, strand, lenreadidx)

Set in-read count

set_bincounts(cntidx)

Set region-index for self.key (Name of the Series i.e. sample that is counted) to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_bin_counter(label, key)

Bases: Sample_counter

Class for making dict counters, with bin-regions in keys for counts.

munrcntr

Dict to hold counts for reads on munro-strand, divided over regions and bins (as key/index) with SHORT name as column key/counter name.

Type:

collections.Counter

corbcntr

Dict to hold counts for reads on corbett-strand; as munrcntr but holding counts for reads from opposite strand.

Type:

collections.Counter

munrcount

Tracks munro-strand associated counts linked to region iterated; reset when a new region is iterated over.

Type:

int or float

corbcount

Tracks corbett-strand associated counts for region iterated

Type:

int or float

munrcntr
corbcntr
update_strand_count(val, strand, lenreadidx)

Set in-read count, ignore lenreadidx)

set_bincounts(cntidx)

Set region-index for self(key) to stranded counts; prepare for next round by resetting strand count.

Parameters:
  • cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

  • key (str) – Name of the Series (sample that is counted).

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_skip_counter(label, key)

Bases: Sample_counter

Class to keep track of skipped reads with imperfect alignments or more than one hit-index for uncollapsed reads (should be none).

notfram

As munrfram or corbfram, but not stranded.

Type:

pd.DataFrame

not_keycntr

As munr_keycntr or corb_keycntr, but not stranded.

Type:

pd.Series

notcount

Skipped count linked to key (SHORT name for sample).

Type:

int

notcntr
set_bincounts(cntidx)

Set region-index for key to stranded counts; prepare for next round by resetting strand count.

Parameters:
  • cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

  • key (str) – Name of the Series (sample that is counted).

update_strand_count(val, strand, lenreadidx)

Set in-read count

skip_count(val)

Use one counter, independent of strand.

Parameters:

int (val =) – Value to add to gathered counts.

report_skipped()

Indicate how many reads have been skipped for this sample.

class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_length_counter(label, key)

Bases: Sample_counter

Class for making data-frame counters cataloging lengths.

Notes

pd.DataFrame is used instead of collections.Counter in order to keep float nature of counts for multimapped reads

idxnam

Name for index column when dataframes get saved to TSV.

Type:

str

cntlabl

Label to mark type of counts in saved file name.

Type:

str

update_strand_count(val, strand, lenreadidx)

Forward strand count to Series counter.

Parameters:
  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand where reads come from (MUNR or CORB).

  • lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)
class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_multimap_counter(label, key)

Bases: Sample_length_counter

Class for making data-frame counters cataloging multimappers.

cntlabl = 'multimapper'
idxnam = 'repeats'
class coalispr.bedgraph_analyze.bam_sample_counter_counters.Sample_region_length_counter(label, key)

Bases: Sample_counter

Counters cataloging region counts without strand information.

keycntr
Dictionary to hold counts with region info as index; specific for

SHORT (key) name for counted sample; reset for counting another

sample.

Type:

collections.Counter

set_bincounts(cntidx)

Set region-index for self.key (Name of the Series i.e. sample that is counted) to stranded counts; prepare for next round by resetting strand count.

Parameters:

cntidx (tuple) – Tuple to form index with: (region, binno, binregion).

update_strand_count(val, strand, lenreadidx)

Forward strand count to key-counter; ignore strand.

Parameters:
  • val (int or float) – Value to add to gathered counts.

  • strand (str) – Name for strand (MUNR/CORB) where reads come from; not used.

  • lenreadidx (index) – Index describing readlength to add counts to.

update_multimap_count(val, strand, lenreadidx)