coalispr.bedgraph_analyze.process_bamdata

Module to count bam files based on specification of aligned reads.

Attributes

Classes

Bam_collector

Class to keep track of bam files.

Total_counter

Read_checker

Class to store outcome of cigar and other read checks.

Bam_countprocessor

Bedgraphs_from_unselected

Class to create bedgraphs from new bam-files with unselected reads.

Functions

total_raw_counts(tagBam, stranded[, force])

Obtain total mapped reads and unmapped reads from alignments.

count_folder(kind, bam, segments, overmax, maincut, ...)

Return folder with stored count files

has_been_counted(typeofcount, kind[, folder])

Check whether count files have been created.

process_bam_files(bins, kind, writebam, force, cigchk, ...)

Obtain count data from available bam-alignment files. When possible,

process_reads_for_region(samples, chrnam, region, ...)

Obtain read-length data for a particular region on chromosome chrnam

Module Contents

coalispr.bedgraph_analyze.process_bamdata.logger
class coalispr.bedgraph_analyze.process_bamdata.Bam_collector

Class to keep track of bam files.

bamkeys
nodiscards_keys
bamfiles
tag
bamkeys: list
nodiscards_keys: list
bamfiles: dict
tag: str
classmethod collect_bamfiles(tag=TAGBAM, src_dir=SRCDIR, ndirlevels=SRCNDIRLEVEL)

Retrieve all bam-file names for counting aligned reads.

These are marked by SAMBAM.

Parameters:
  • tag (str (default: TAGBAM)) – Type of aligned-reads (collapsed or uncollapsed).

  • src_dir (Path (default: SRCDIR)) – Path to folder with sequencing data incl. bamfiles.

  • ndirlevels (int (default: SRCNDIRLEVEL)) – Number of subdirectories to traverse from SRC directory to get to bamfiles.

Returns:

Dictionary of sample (SHORT) names and paths to associated SAMBAM-files.

Return type:

dict

classmethod get_bamfiles(tag)
classmethod num_counted_libs(plusdiscards=True)

Retrieve number of counted libraries from number counted bam-files.

classmethod keys_counted_libs(plusdiscards=True)

Retrieve keys linked to counted bam-files.

coalispr.bedgraph_analyze.process_bamdata.total_raw_counts(tagBam, stranded, force=False)

Obtain total mapped reads and unmapped reads from alignments.

Parameters:
  • tagBam (str) – Determines type of reads counted, TAGUNCOLL, or TAGCOLL.

  • stranded (bool) – Flag to output total counts for each strand.

  • force (bool) – Flag to redo counting (while backing up previous results).

Returns:

A text file with tab-separated columns giving total input numbers for all experiments.

Return type:

A TSV file

class coalispr.bedgraph_analyze.process_bamdata.Total_counter
samheader_counters: list
unmapped_counters: list
bams: dict
tagBam: str
tsvpath: pathlib.Path
filepath: pathlib.Path
keys: list
unmap_in_sam: bool
suffuncs
totalfram: pandas.DataFrame
classmethod init_raw_count(tagBam, stranded, force)

Set up common variables.

classmethod get_samheader_counts(key)

Get counts from SAM header or from line counting in BAM file.

classmethod process_samheader_counts()

Create dataframe from obtained counts.

classmethod count_unmapped_lines(key)

Count lines in file with unmapped reads, put apart during alignment.

classmethod external_unmapped_counts()

Obtain counts for unmapped reads put aside in UNMAPPEDFIL, another file than the BAM alignment file.

classmethod process_external_unmapped()

Process counts obtained for UNMAPPEDFIL reads.

classmethod process_totalfram()

Save dataframe with all counts to file.

classmethod get_raw_counts(tagBam=None, stranded=False, force=False)
coalispr.bedgraph_analyze.process_bamdata.count_folder(kind, bam, segments, overmax, maincut, usegaps)

Return folder with stored count files

coalispr.bedgraph_analyze.process_bamdata.has_been_counted(typeofcount, kind, folder=None)

Check whether count files have been created.

Parameters:
  • typeofcount (str) – Pattern to find specific files

  • kind (str) – Select kind of reads that have been counted, either SPECIFIC or UNSPECIFIC

Return type:

boolean to indicate count file is present (True) or not (False)

class coalispr.bedgraph_analyze.process_bamdata.Read_checker

Class to store outcome of cigar and other read checks.

unfit_read
Type:

fcie

nomis

Indicates number of mismatches (default: NRMISM)

Type:

int

okintrons
Type:

list

unfit_read = None
nomis: int
okintrons: list
classmethod make_cigarcheck(cigchk, nomis=NRMISM)

Take cigar items as marked by ‘cigartuples; cigarstring; <meaning>’:

0;M <match>,
1;I <insertion>,
2;D <deletion>,
3;N <skipped>,

which are standard (the other accepted cigar items:

4;S <soft clip>,
5;H <hard clip>,
6;P <padding>,
7;= <sequence match>,
8;X <sequence mismatch, substitution>

are indirectly used here. Skip if alignment is dubious; for short SE sequences only accept matches (0;M) and gaps (3;N); for reads from UV-treated samples accept a point-deletion (2;D)

Parameters:
  • cigchk (str (from CIGARCHK, either CIGPD or CIGFM)) – Defines function to use for checking a read.

  • nomis (int (default: NRMISM))

classmethod ok_read(read, intronchk=True)

Check cigar and number of tolerated mismatches for each read.

Parameters:
  • read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.

  • intronchk (bool) – Check for introns and gather their lengths (or not).

Return type:

True or False and sets list of intron-lengths for valid introns

classmethod hitidx(hit_idx, beancounter)

Check hit-index for read as defined in alignment file.

classmethod ok_strand(read, strand)

Check strand of read for inclusion in counts

Parameters:
  • read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.

  • strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).

Return type:

True or False

class coalispr.bedgraph_analyze.process_bamdata.Bam_countprocessor
bampath: pathlib.Path
bampeak: tuple
bams: dict
bamstart: int
beancollector: coalispr.bedgraph_analyze.bam_keys_collators.Bam_samples_collator
collected_counters: list
comparereads: list
countselect: bool
bins: int
cols: list
force: bool
gap: int
has_chrxtra: bool
keys: list
kind: str
region: str
segs: dict
selectbam: pysam.AlignmentFile
strand: str
TEST: bool
tsvpath: pathlib.Path
writebam: bool
cigchk = 'fullmatch'
maincut = 1
nomis = 0
tresh = 5
classmethod run_all_bamcount_processes(bins, kind, writebam, force, cigchk, nomis)

Steps to obtain counts from all bam files, by sequential or multi- processing. Inner function needed to get run-time counting.

classmethod process_bamfile(key)

Multiprocess this part.

classmethod check_bam_file(inbam)
classmethod check_mulmap()
classmethod init_counting(bins, kind, writebam, force, cigchk, nomis)
classmethod run_region_count_processes(samples, chrnam, region, strand, comparereads, cigchk, nomis)
coalispr.bedgraph_analyze.process_bamdata.process_bam_files(bins, kind, writebam, force, cigchk, nomis)

Obtain count data from available bam-alignment files. When possible, uses multi-processing, otherwise sequential counting of files.

Parameters:
  • bins (int) – Number of bins counts are split over (default BINS)

  • kind (str) – Kind of read, SPECIFIC or UNSPECIFIC

  • writebam (bool) – Save UNSELECTED UNSPECIFIC reads in bam-alignment files for inclusion to showgraphs as a separate group of reads.

  • force (bool) – Allow recounting samples, backup existing count files.

  • cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.

  • nomis (int) – Number of tolerated substitutions, default: NRMISM.

coalispr.bedgraph_analyze.process_bamdata.process_reads_for_region(samples, chrnam, region, strand, comparereads, cigchk, nomis)

Obtain read-length data for a particular region on chromosome chrnam for given samples.

Parameters:
  • samples (list) – List of short names to retrieve bamfiles with alignment data for.

  • chrnam (str) – Name of chromosome to retrieve region from.

  • region (tuple) – Tuple with coordinates for chromosomal region to retrieve counts for.

  • strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).

  • comparereads (list) – List of reads to count for comparison.

  • cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.

  • nomis (int) – Number of tolerated substitutions, default: NRMISM.

Returns:

output – For making graphs, a pandas.Dataframe with read counts and a list of dataframes with counts for read lengths.

Return type:

tuple

class coalispr.bedgraph_analyze.process_bamdata.Bedgraphs_from_unselected

Class to create bedgraphs from new bam-files with unselected reads.

Attributes:

bampath: Path

Path to input bam-file.

classmethod sortindex(bampath)
classmethod bedgraphs_from_xtra_bamdata(bampath)

Create bedgraph files from selected bamdata.

During specification of reads, target-like RNAs (like siRNAs) may be thrown out because of overlap with more abundant reads in the negative controls. Based on a telling determinant (like start-nucleotide and length range for siRNA) such target-like reads can be retrieved during counting of unspecified reads and copied to new bam files. Here, these reads are extracted and processed. The fraction of such target-like reads may come along by chance, and therefore could represent false positives, especially when these do not stand out for positive controls or change by interfering mutations/conditions.

Bam files need to be sorted and indexed before they can be converted to bedgraphs