coalispr.bedgraph_analyze.process_bamdata¶

Module to count bam files based on specification of aligned reads.

Attributes¶

logger

Functions¶

`collect_bamfiles`([tag, src_dir, ndirlevels])	Retrieve all bam-file names for counting aligned reads.
`num_counted_libs`([plusdiscards])	Retrieve number of counted libraries from number counted bam-files.
`keys_counted_libs`([plusdiscards])	Retrieve keys linked to counted bam-files.
`total_raw_counts`([tagBam, stranded, force])	Obtain total mapped reads and unmapped reads from alignments.
`count_folder`(kind, bam, segments, overmax, maincut, ...)	Return folder with stored count files
`has_been_counted`([typeofcount, kind])	Check whether count files have been created.
`process_bamfiles`([tagSeg, tagBam, bins, tresh, ...])	Extract reads from a bamfile using selected_regions and count them.
`process_reads_for_region`(samples, chrnam, region, ...)	Obtain read-length data for a particular region on chromosome chrnam for
`bedgraphs_from_xtra_bamdata`(bampath[, force])	Create bedgraph files from selected bamdata.

Module Contents¶

coalispr.bedgraph_analyze.process_bamdata.logger¶

coalispr.bedgraph_analyze.process_bamdata.collect_bamfiles(tag=TAGBAM, src_dir=SRCDIR, ndirlevels=SRCNDIRLEVEL)¶

Retrieve all bam-file names for counting aligned reads.

These are marked by SAMBAM.

Parameters:

tag (str (default: TAGBAM)) – Sort of aligned-reads (collapsed or uncollapsed).
src_dir (Path (default: SRCDIR)) – Path to folder with sequencing data incl. bamfiles.
ndirlevels (int (default: SRCNDIRLEVEL)) – Number of subdirectories to traverse from SRC directory to get to bamfiles.

Returns:

Dictionary of sample (SHORT) names and paths to associated SAMBAM-files.

Return type:

dict

coalispr.bedgraph_analyze.process_bamdata.num_counted_libs(plusdiscards=True)¶: Retrieve number of counted libraries from number counted bam-files.

coalispr.bedgraph_analyze.process_bamdata.keys_counted_libs(plusdiscards=True)¶: Retrieve keys linked to counted bam-files.

coalispr.bedgraph_analyze.process_bamdata.total_raw_counts(tagBam=None, stranded=False, force=False)¶

Obtain total mapped reads and unmapped reads from alignments.

Returns:: A text file with tab-separated columns giving total input numbers for all experiments.
Return type:: A TSV file

coalispr.bedgraph_analyze.process_bamdata.count_folder(kind, bam, segments, overmax, maincut, usegaps)¶: Return folder with stored count files

coalispr.bedgraph_analyze.process_bamdata.has_been_counted(typeofcount='', kind=SPECIFIC)¶

Check whether count files have been created.

Parameters:

typeofcount (str) – Pattern to find specific files
kind (str) – Selct kind of reads that have been counted, either SPECIFIC or UNSPECIFIC

Return type:

boolean to indicate count file is present (True) or not (False)

coalispr.bedgraph_analyze.process_bamdata.process_bamfiles(tagSeg=TAGSEG, tagBam=TAGBAM, bins=BINS, tresh=LOG2BG, maincut=UNSPECLOG10, kind=SPECIFIC, writebam=False, force=False, test=False, cigchk=CIGARCHK, nomis=NRMISM)¶

Extract reads from a bamfile using selected_regions and count them.

Allow for counting Bam-alignment files obtained for TAGCOLL reads with segments found for TAGUNCOLL reads.

Parameters:

tagSeg (str (default: TAGSEG)) – Sort of aligned-reads (TAGCOLL or TAGUNCOLL) used for generating segment files.
tagBam (str (default: TAGBAM)) – Sort of aligned-reads (TAGCOLL or TAGUNCOLL) used for generating alignment files.
bins (int (default: BINS)) – The number of sub-segments of equal length a contiguous segment with reads needs to be partitioned in for counting. This to assess coverage/ density of reads dependent of location in the main segment.
tresh (int (default: LOG2BG)) – Treshold level (2^tresh) of accepted background.
maincut (float (default: UNSPECLOG10)) – Fold difference (10^maincut) between bedgraph values of reads called ‘specific’ vs. ‘unspecific’ when these overlap.
kind (str (default: SPECIFIC)) – Kind of specified aligned reads to be counted.
writebam (bool (default: False)) – Do siRNA-like alignments need to be copied to separate bamfiles? Can be true when counting unspecific reads to see how many reads that fit criteria of genuine siRNAs have been omitted due to set thresholds; most will be fragments of abundant transcripts though.
force (bool (default: False)) – Ignore previous counting; go ahead after backing old counts up.
test (bool (defaul: False)) – Count a subset of samples for testing or profiling
cigchk (str (default: CIGARCHK)) – Label to mark function for checking cigar string of a read alignmemnt.
nomis (int (default: NRMISM))

Returns:

The TSV files are saved to the configured STOREPATH.

Return type:

A series of TSV files with count data

coalispr.bedgraph_analyze.process_bamdata.process_reads_for_region(samples, chrnam, region, strand, comparereads, cigchk=CIGARCHK, nomis=NRMISM, tagBam=TAGBAM)¶

Obtain read-length data for a particular region on chromosome chrnam for given samples.

Parameters:

samples (list) – List of short names to retrieve bamfiles with alignment data for.
chrnam (str) – Name of chromosome to retrieve region from.
region (tuple) – Tuple with coordinates for chromosomal region to retrieve counts for.
strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).
comparereads (list) – List of reads to count for comparison.
tagBam (str (default: TAGBAM)) – Sort of aligned-reads (TAGCOLL or TAGUNCOLL) used for generating alignment files.

coalispr.bedgraph_analyze.process_bamdata.bedgraphs_from_xtra_bamdata(bampath, force=False)¶

Create bedgraph files from selected bamdata.

During specification of reads, genuine siRNAs can be thrown out due to overlap with unspecific reads even if these would not be siRNAs. Thus, based on start-nucleotide and length range, siRNAs can be retrieved during counting of unspecified reads and copied to new bam files. Here, extract and process these reads.

Bam files need to be sorted and indexed before they can be converted to bedgraphs