coalispr.bedgraph_analyze.process_bamdata

Module to count bam files based on specification of aligned reads.

Attributes

Classes

Bam_collector

Class to keep track of bam files.

Read_checker

Class to store outcome of cigar and other read checks.

Bam_countprocessor

Functions

total_raw_counts([tagBam, stranded, force])

Obtain total mapped reads and unmapped reads from alignments.

count_folder(kind, bam, segments, overmax, maincut, ...)

Return folder with stored count files

has_been_counted([typeofcount, kind])

Check whether count files have been created.

process_bam_files(bins, kind, writebam, force, cigchk, ...)

Obtain count data from available bam-alignment files. When possible,

process_reads_for_region(samples, chrnam, region, ...)

Obtain read-length data for a particular region on chromosome chrnam

bedgraphs_from_xtra_bamdata(bampath[, force])

Create bedgraph files from selected bamdata.

Module Contents

coalispr.bedgraph_analyze.process_bamdata.logger
class coalispr.bedgraph_analyze.process_bamdata.Bam_collector

Class to keep track of bam files.

bamkeys
nodiscards_keys
bamfiles
tag
bamkeys: list
nodiscards_keys: list
bamfiles: dict
tag: str
classmethod collect_bamfiles(tag=TAGBAM, src_dir=SRCDIR, ndirlevels=SRCNDIRLEVEL)

Retrieve all bam-file names for counting aligned reads.

These are marked by SAMBAM.

Parameters:
  • tag (str (default: TAGBAM)) – Sort of aligned-reads (collapsed or uncollapsed).

  • src_dir (Path (default: SRCDIR)) – Path to folder with sequencing data incl. bamfiles.

  • ndirlevels (int (default: SRCNDIRLEVEL)) – Number of subdirectories to traverse from SRC directory to get to bamfiles.

Returns:

Dictionary of sample (SHORT) names and paths to associated SAMBAM-files.

Return type:

dict

classmethod get_bamfiles(tag=TAGBAM)
classmethod num_counted_libs(plusdiscards=True)

Retrieve number of counted libraries from number counted bam-files.

classmethod keys_counted_libs(plusdiscards=True)

Retrieve keys linked to counted bam-files.

coalispr.bedgraph_analyze.process_bamdata.total_raw_counts(tagBam=None, stranded=False, force=False)

Obtain total mapped reads and unmapped reads from alignments.

Returns:

A text file with tab-separated columns giving total input numbers for all experiments.

Return type:

A TSV file

coalispr.bedgraph_analyze.process_bamdata.count_folder(kind, bam, segments, overmax, maincut, usegaps)

Return folder with stored count files

coalispr.bedgraph_analyze.process_bamdata.has_been_counted(typeofcount='', kind=SPECIFIC)

Check whether count files have been created.

Parameters:
  • typeofcount (str) – Pattern to find specific files

  • kind (str) – Selct kind of reads that have been counted, either SPECIFIC or UNSPECIFIC

Return type:

boolean to indicate count file is present (True) or not (False)

class coalispr.bedgraph_analyze.process_bamdata.Read_checker

Class to store outcome of cigar and other read checks.

unfit_read
Type:

fcie

nomis

Indicates number of mismatches (default: NRMISM)

Type:

int

okintrons
Type:

list

unfit_read = None
nomis: int
okintrons: list
classmethod make_cigarcheck(cigchk, nomis=NRMISM)

Take cigar items as marked by ‘cigartuples; cigarstring; <meaning>’:

0;M <match>,
1;I <insertion>,
2;D <deletion>,
3;N <skipped>,

which are standard (the other accepted cigar items:

4;S <soft clip>,
5;H <hard clip>,
6;P <padding>,
7;= <sequence match>,
8;X <sequence mismatch, substitution>

are indirectly used here. Skip if alignment is dubious; for short SE sequences only accept matches (0;M) and gaps (3;N); for reads from UV-treated samples accept a point-deletion (2;D)

Parameters:
  • cigchk (str (from CIGARCHK, either CIGPD or CIGFM)) – Defines function to use for checking a read.

  • nomis (int (default: NRMISM))

classmethod ok_read(read, intronchk=True)

Check cigar and number of tolerated mismatches for each read.

Parameters:
  • read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.

  • intronchk (bool) – Check for introns and gather their lengths (or not).

Return type:

True or False and sets list of intron-lengths for valid introns

classmethod ok_strand(read, strand)

Check strand of read for inclusion in counts

Parameters:
  • read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.

  • strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).

Return type:

True or False

class coalispr.bedgraph_analyze.process_bamdata.Bam_countprocessor
bampath: pathlib.Path
bampeak: tuple
bams: dict
bamstart: int
beancollector: coalispr.bedgraph_analyze.bam_keys_collators.Bam_samples_collator
collected_counters: list
comparereads: list
bins: int
cols: list
force: bool
gap: int
keys: list
kind: str
region: str
segs: dict
selectbam: pysam.AlignmentFile
strand: str
tagBam: str
tagSeg: str
TEST: bool
tsvpath: pathlib.Path
writebam: bool
cigchk = 'fullmatch'
maincut = 0.78
nomis = 0
tresh = 5
classmethod run_all_bamcount_processes(bins, kind, writebam, force, cigchk, nomis)

Steps to obtain counts from all bam files, by sequential or multi- processing.

classmethod process_bamfile(key)
classmethod run_region_count_processes(samples, chrnam, region, strand, comparereads, cigchk, nomis)
classmethod check_bam_file(inbam)
classmethod init_counting(bins, kind, writebam, force, cigchk, nomis)
coalispr.bedgraph_analyze.process_bamdata.process_bam_files(bins, kind, writebam, force, cigchk, nomis)

Obtain count data from available bam-alignment files. When possible, uses multi-processing, otherwise sequential counting of files.

Parameters:
  • bins (int) – Number of bins counts are split over (default BINS)

  • kind (str) – Kind of read, SPECIFIC or UNSPECIFIC

  • writebam (bool) – Save UNSELECTED UNSPECIFIC reads in bam-alignment files for inclusion to showgraphs as a separate group of reads.

  • force (bool) – Allow recounting samples, backup existing count files.

  • cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.

  • nomis (int) – Number of tolerated substitutions, default: NRMISM.

coalispr.bedgraph_analyze.process_bamdata.process_reads_for_region(samples, chrnam, region, strand, comparereads, cigchk, nomis)

Obtain read-length data for a particular region on chromosome chrnam for given samples.

Parameters:
  • samples (list) – List of short names to retrieve bamfiles with alignment data for.

  • chrnam (str) – Name of chromosome to retrieve region from.

  • region (tuple) – Tuple with coordinates for chromosomal region to retrieve counts for.

  • strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).

  • comparereads (list) – List of reads to count for comparison.

  • cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.

  • nomis (int) – Number of tolerated substitutions, default: NRMISM.

Returns:

output – For making graphs, a pandas.Dataframe with read counts and a list of dataframes with counts for read lengths.

Return type:

tuple

coalispr.bedgraph_analyze.process_bamdata.bedgraphs_from_xtra_bamdata(bampath, force=False)

Create bedgraph files from selected bamdata.

During specification of reads, genuine siRNAs can be thrown out due to overlap with unspecific reads even if these would not be siRNAs. Thus, based on start-nucleotide and length range, siRNAs can be retrieved during counting of unspecified reads and copied to new bam files. Here, extract and process these reads.

Bam files need to be sorted and indexed before they can be converted to bedgraphs