coalispr.bedgraph_analyze.process_bamdata

Module to count bam files based on specification of aligned reads.

Attributes

Classes

Segments

Class to retrieve reusable iterators for segments with reads to count.

Bam_collector

Class to keep track of bam files.

Total_counter

Read_checker

Class for doing cigar and other read checks.

Bam_countprocessor

Functions

total_raw_counts(tagBam, stranded, force, suffix)

Obtain total mapped reads and unmapped reads from alignments.

has_been_counted(typeofcount, kind, suffix[, folder])

Check whether count files have been created.

has_been_done(apath, force[, unspec])

Check whether counting or copying of bamfiles has been done before.

get_bam_selection()

count_bamfiles(bins, kind, writebam, force, cigchk, ...)

Obtain count data from available bam-alignment files. When possible,

count_region(samples, chrnam, region, strand, ...)

Obtain read-length data for a particular region on chromosome chrnam

Module Contents

coalispr.bedgraph_analyze.process_bamdata.logger
class coalispr.bedgraph_analyze.process_bamdata.Segments(chrnam, kind)

Class to retrieve reusable iterators for segments with reads to count. From: Brett Slatkin, Effective Python, Item 17; also on: https://dev.to/v_it_aly/python-tips-how-to-reuse-a-generator-within-one-function-a5o

Needs to be top level and not an inner class in order to be pickable, which is required for multiprocessing when counting reads that map to particular segments.

df1, df2

Pair of dataframes for PLUS and MINUS strands

Type:

DataFrame, DataFrame

Parameters:

chrnam (str) – Chromosome to yield segments for.

class coalispr.bedgraph_analyze.process_bamdata.Bam_collector

Class to keep track of bam files.

bamkeys
nodiscards_keys
bamfiles
tag
bamkeys: list
nodiscards_keys: list
bamfiles: dict
tag: str
classmethod collect_bamfiles(tag=TAGBAM, src_dir=SRCFLDR, ndirlevels=SRCNDIRLEVEL)

Retrieve all bam-file names for counting aligned reads.

These are marked by BAM.

Parameters:
  • tag (str (default: TAGBAM)) – Type of aligned-reads (collapsed or uncollapsed).

  • src_dir (Path (default: SRCDIR)) – Path to folder with sequencing data incl. bamfiles.

  • ndirlevels (int (default: SRCNDIRLEVEL)) – Number of subdirectories to traverse from SRC directory to get to bamfiles.

Returns:

Dictionary of sample (SHORT) names and paths to associated BAM-files.

Return type:

dict

classmethod get_bamfiles(tag)
classmethod num_counted_libs(plusdiscards=True)

Retrieve number of counted libraries from number counted bam-files.

classmethod keys_counted_libs(plusdiscards=True)

Retrieve keys linked to counted bam-files.

coalispr.bedgraph_analyze.process_bamdata.total_raw_counts(tagBam, stranded, force, suffix)

Obtain total mapped reads and unmapped reads from alignments.

Parameters:
  • tagBam (str) – Determines type of reads counted, TAGUNCOLL, or TAGCOLL.

  • stranded (bool) – Flag to output total counts for each strand.

  • force (bool) – Flag to redo counting (while backing up previous results).

  • suffix (str) – Suffix, BNY or TSV to define output format of count files as binary or tabbed csv.

Returns:

A text file with tab-separated columns giving total input numbers for all experiments.

Return type:

A TSV file

class coalispr.bedgraph_analyze.process_bamdata.Total_counter
samheader_counters: list
unmapped_counters: list
bams: dict
tagBam: str
suffixpath: pathlib.Path
filepath: pathlib.Path
keys: list
stranded: bool
suffix: str
suffuncs
totalfram: pandas.DataFrame
unmap_in_sam: bool
classmethod init_raw_count(tagBam, stranded, force, suffix)

Set up common variables.

init_samheader_counts()

Initiator function, run by each spawned process

get_samheader_counts()

Worker function; get counts from SAM header or from line counting in BAM file.

classmethod process_samheader_counters()

Create dataframe from obtained counts.

init_unmapped_counts()

Initiator function run by each spawned process.

readhere()

Helper function

count_unmapped_lines()

Worker function; count lines in file with unmapped reads (that has been created during alignment by an aligner like STAR).

classmethod external_unmapped_counts()

Obtain counts for unmapped reads put aside in UNMAPPEDFIL, another file than the BAM alignment file.

classmethod process_external_unmapped()

Process counts obtained for UNMAPPEDFIL reads.

classmethod process_totalfram()

Save dataframe with all counts to file.

classmethod get_raw_counts(tagBam, stranded, force, suffix)
coalispr.bedgraph_analyze.process_bamdata.has_been_counted(typeofcount, kind, suffix, folder=None)

Check whether count files have been created.

Parameters:
  • typeofcount (str) – Pattern to find counted files

  • kind (str) – Select kind of reads that have been counted, either SPECIFIC or UNSPECIFIC

  • suffix (str) – Suffix, BNY or TSV to define output format of count files as binary or tabbed csv.

  • folder (str) – Default folder name based on kind.

Return type:

boolean to indicate count file is present (True) or not (False)

coalispr.bedgraph_analyze.process_bamdata.has_been_done(apath, force, unspec=False)

Check whether counting or copying of bamfiles has been done before. This function helped initially to protect the user to accidentally redo a very lengthy count step. Most time - when counting unspecific reads - goes now into writing bam files, processing these to bedgraphs and then binning and merging the information for use with ‘showgraphs’.

Parameters:
  • apath (Path) – Path to location where count or bam files would be saved.

  • force (bool) – Flag to continue despite done before

  • unspec (bool) – Flag to determine whether unspecific unselected reads need to be done.

Returns:

True when count-files are present; False when not.

Return type:

bool

coalispr.bedgraph_analyze.process_bamdata.get_bam_selection()
class coalispr.bedgraph_analyze.process_bamdata.Read_checker(cigchk, nomis=NRMISM)

Class for doing cigar and other read checks. Take cigar items as marked by ‘cigartuples; cigarstring; <meaning>’:

0;M <match>,
1;I <insertion>,
2;D <deletion>,
3;N <skipped>,

which are standard (the other accepted cigar items:

4;S <soft clip>,
5;H <hard clip>,
6;P <padding>,
7;= <sequence match>,
8;X <sequence mismatch, substitution>

are indirectly used here. Skip if alignment is dubious; for short SE sequences only accept matches (0;M) and gaps (3;N); for reads from UV-treated samples accept a point-deletion (2;D)

cigchk

Defines function to use for checking a read.

Type:

str (from CIGARCHK, either CIGPD or CIGFM)

nomis

Indicates number of mismatches (default: NRMISM)

Type:

int

okintrons
Type:

list

nomis = 0
ok_introns = []
fullmatch(cigtuples)
pointdel(cigtuples)
ok_read(read, intronchk=True)

Check cigar and number of tolerated mismatches for each read.

Parameters:
  • read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.

  • intronchk (bool) – Check for introns and gather their lengths (or not).

Return type:

True or False and sets list of intron-lengths for valid introns

hitidx(hit_idx, beancounter)

Check hit-index for read as defined in alignment file.

ok_strand(read, strand)

Check strand of read for inclusion in counts

Parameters:
  • read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.

  • strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).

Return type:

True or False

class coalispr.bedgraph_analyze.process_bamdata.Bam_countprocessor
bampath: pathlib.Path
bampeak: tuple
bams: dict
bamstart: int
beancollector: coalispr.bedgraph_analyze.bam_keys_collators.Bam_samples_collator
collected_counters: list
comparereads: list
countunselect: bool
bins: int
cols: list
force: bool
gap: int
keys: list
kind: str
region: str
segs: dict
selectbam: pysam.AlignmentFile
smallest2chroms: list
strand: str
suffix: str
suffixpath: pathlib.Path
TEST: bool
writebam: bool
cigchk = 'fullmatch'
maincut = 0.78
nomis = 0
tresh = 5
classmethod run_all_bamcount_processes(bins, kind, writebam, force, cigchk, nomis, suffix)

Steps to obtain counts from all bam files, by sequential or multi- processing. Inner function needed to get run-time counting.

initiate_counting_for_sample()

Initiation function for multiprocessing for defining globals in each spawned process.

process_bamfile()

Worker function in multiprocessing.

classmethod check_bam_file(inbam)
classmethod check_mulmap()
classmethod init_counting(bins, kind, writebam, force, cigchk, nomis, suffix)
classmethod run_region_count_processes(samples, chrnam, region, strand, comparereads, cigchk, nomis, suffix)
coalispr.bedgraph_analyze.process_bamdata.count_bamfiles(bins, kind, writebam, force, cigchk, nomis, suffix)

Obtain count data from available bam-alignment files. When possible, uses multiprocessing, otherwise sequential counting of files.

Parameters:
  • bins (int) – Number of bins counts are split over (default BINS)

  • kind (str) – Kind of read, SPECIFIC or UNSPECIFIC

  • writebam (bool) – Save UNSELECTED UNSPECIFIC reads to separate bam-alignment files for inclusion as a separate group of reads to display by showgraphs.

  • force (bool) – Allow recounting samples, backup existing count files.

  • cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.

  • nomis (int) – Number of tolerated substitutions, default: NRMISM.

  • suffix (str.) – Sets output file extension, BNY or TSV.

coalispr.bedgraph_analyze.process_bamdata.count_region(samples, chrnam, region, strand, comparereads, cigchk, nomis, suffix)

Obtain read-length data for a particular region on chromosome chrnam for given samples.

Parameters:
  • samples (list) – List of short names to retrieve bamfiles with alignment data for.

  • chrnam (str) – Name of chromosome to retrieve region from.

  • region (tuple) – Tuple with coordinates for chromosomal region to retrieve counts for.

  • strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).

  • comparereads (list) – List of reads to count for comparison.

  • cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.

  • nomis (int) – Number of tolerated substitutions, default: NRMISM.

  • suffix (str.) – Sets output file extension, BNY or TSV.

Returns:

output – For making graphs, a pandas.Dataframe with read counts and a list of dataframes with counts for read lengths.

Return type:

tuple