coalispr.bedgraph_analyze.process_bamdata¶
Module to count bam files based on specification of aligned reads.
Attributes¶
Classes¶
Class to retrieve reusable iterators for segments with reads to count. |
|
Class to keep track of bam files. |
|
Class for doing cigar and other read checks. |
|
Functions¶
|
Obtain total mapped reads and unmapped reads from alignments. |
|
Check whether count files have been created. |
|
Check whether counting or copying of bamfiles has been done before. |
|
Obtain count data from available bam-alignment files. When possible, |
|
Obtain read-length data for a particular region on chromosome chrnam |
Module Contents¶
- coalispr.bedgraph_analyze.process_bamdata.logger¶
- class coalispr.bedgraph_analyze.process_bamdata.Segments(chrnam, kind)¶
Class to retrieve reusable iterators for segments with reads to count. From: Brett Slatkin, Effective Python, Item 17; also on: https://dev.to/v_it_aly/python-tips-how-to-reuse-a-generator-within-one-function-a5o
Needs to be top level and not an inner class in order to be pickable, which is required for multiprocessing when counting reads that map to particular segments.
- df1, df2
Pair of dataframes for PLUS and MINUS strands
- Type:
DataFrame, DataFrame
- Parameters:
chrnam (str) – Chromosome to yield segments for.
- class coalispr.bedgraph_analyze.process_bamdata.Bam_collector¶
Class to keep track of bam files.
- bamkeys¶
- nodiscards_keys¶
- bamfiles¶
- tag¶
- bamkeys: list¶
- nodiscards_keys: list¶
- bamfiles: dict¶
- tag: str¶
- classmethod collect_bamfiles(tag=TAGBAM, src_dir=SRCFLDR, ndirlevels=SRCNDIRLEVEL)¶
Retrieve all bam-file names for counting aligned reads.
These are marked by BAM.
- Parameters:
tag (str (default: TAGBAM)) – Type of aligned-reads (collapsed or uncollapsed).
src_dir (Path (default: SRCDIR)) – Path to folder with sequencing data incl. bamfiles.
ndirlevels (int (default: SRCNDIRLEVEL)) – Number of subdirectories to traverse from SRC directory to get to bamfiles.
- Returns:
Dictionary of sample (SHORT) names and paths to associated BAM-files.
- Return type:
dict
- classmethod get_bamfiles(tag)¶
- classmethod num_counted_libs(plusdiscards=True)¶
Retrieve number of counted libraries from number counted bam-files.
- classmethod keys_counted_libs(plusdiscards=True)¶
Retrieve keys linked to counted bam-files.
- coalispr.bedgraph_analyze.process_bamdata.total_raw_counts(tagBam, stranded, force, suffix)¶
Obtain total mapped reads and unmapped reads from alignments.
- Parameters:
tagBam (str) – Determines type of reads counted, TAGUNCOLL, or TAGCOLL.
stranded (bool) – Flag to output total counts for each strand.
force (bool) – Flag to redo counting (while backing up previous results).
suffix (str) – Suffix, BNY or TSV to define output format of count files as binary or tabbed csv.
- Returns:
A text file with tab-separated columns giving total input numbers for all experiments.
- Return type:
A TSV file
- class coalispr.bedgraph_analyze.process_bamdata.Total_counter¶
- samheader_counters: list¶
- unmapped_counters: list¶
- bams: dict¶
- tagBam: str¶
- suffixpath: pathlib.Path¶
- filepath: pathlib.Path¶
- keys: list¶
- stranded: bool¶
- suffix: str¶
- suffuncs¶
- totalfram: pandas.DataFrame¶
- unmap_in_sam: bool¶
- classmethod init_raw_count(tagBam, stranded, force, suffix)¶
Set up common variables.
- init_samheader_counts()¶
Initiator function, run by each spawned process
- get_samheader_counts()¶
Worker function; get counts from SAM header or from line counting in BAM file.
- classmethod process_samheader_counters()¶
Create dataframe from obtained counts.
- init_unmapped_counts()¶
Initiator function run by each spawned process.
- readhere()¶
Helper function
- count_unmapped_lines()¶
Worker function; count lines in file with unmapped reads (that has been created during alignment by an aligner like STAR).
- classmethod external_unmapped_counts()¶
Obtain counts for unmapped reads put aside in UNMAPPEDFIL, another file than the BAM alignment file.
- classmethod process_external_unmapped()¶
Process counts obtained for UNMAPPEDFIL reads.
- classmethod process_totalfram()¶
Save dataframe with all counts to file.
- classmethod get_raw_counts(tagBam, stranded, force, suffix)¶
- coalispr.bedgraph_analyze.process_bamdata.has_been_counted(typeofcount, kind, suffix, folder=None)¶
Check whether count files have been created.
- Parameters:
typeofcount (str) – Pattern to find counted files
kind (str) – Select kind of reads that have been counted, either SPECIFIC or UNSPECIFIC
suffix (str) – Suffix, BNY or TSV to define output format of count files as binary or tabbed csv.
folder (str) – Default folder name based on kind.
- Return type:
boolean to indicate count file is present (True) or not (False)
- coalispr.bedgraph_analyze.process_bamdata.has_been_done(apath, force, unspec=False)¶
Check whether counting or copying of bamfiles has been done before. This function helped initially to protect the user to accidentally redo a very lengthy count step. Most time - when counting unspecific reads - goes now into writing bam files, processing these to bedgraphs and then binning and merging the information for use with ‘showgraphs’.
- Parameters:
apath (Path) – Path to location where count or bam files would be saved.
force (bool) – Flag to continue despite done before
unspec (bool) – Flag to determine whether unspecific unselected reads need to be done.
- Returns:
True when count-files are present; False when not.
- Return type:
bool
- coalispr.bedgraph_analyze.process_bamdata.get_bam_selection()¶
- class coalispr.bedgraph_analyze.process_bamdata.Read_checker(cigchk, nomis=NRMISM)¶
Class for doing cigar and other read checks. Take cigar items as marked by ‘cigartuples; cigarstring; <meaning>’:
0;M <match>, 1;I <insertion>, 2;D <deletion>, 3;N <skipped>,
which are standard (the other accepted cigar items:
4;S <soft clip>, 5;H <hard clip>, 6;P <padding>, 7;= <sequence match>, 8;X <sequence mismatch, substitution>
are indirectly used here. Skip if alignment is dubious; for short SE sequences only accept matches (0;M) and gaps (3;N); for reads from UV-treated samples accept a point-deletion (2;D)
- cigchk¶
Defines function to use for checking a read.
- Type:
str (from CIGARCHK, either CIGPD or CIGFM)
- nomis¶
Indicates number of mismatches (default: NRMISM)
- Type:
int
- okintrons¶
- Type:
list
- nomis = 0¶
- ok_introns = []¶
- fullmatch(cigtuples)¶
- pointdel(cigtuples)¶
- ok_read(read, intronchk=True)¶
Check cigar and number of tolerated mismatches for each read.
- Parameters:
read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.
intronchk (bool) – Check for introns and gather their lengths (or not).
- Return type:
True or False and sets list of intron-lengths for valid introns
- hitidx(hit_idx, beancounter)¶
Check hit-index for read as defined in alignment file.
- ok_strand(read, strand)¶
Check strand of read for inclusion in counts
- Parameters:
read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.
strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).
- Return type:
True or False
- class coalispr.bedgraph_analyze.process_bamdata.Bam_countprocessor¶
- bampath: pathlib.Path¶
- bampeak: tuple¶
- bams: dict¶
- bamstart: int¶
- collected_counters: list¶
- comparereads: list¶
- countunselect: bool¶
- bins: int¶
- cols: list¶
- force: bool¶
- gap: int¶
- keys: list¶
- kind: str¶
- region: str¶
- segs: dict¶
- selectbam: pysam.AlignmentFile¶
- smallest2chroms: list¶
- strand: str¶
- suffix: str¶
- suffixpath: pathlib.Path¶
- TEST: bool¶
- writebam: bool¶
- cigchk = 'fullmatch'¶
- maincut = 0.78¶
- nomis = 0¶
- tresh = 5¶
- classmethod run_all_bamcount_processes(bins, kind, writebam, force, cigchk, nomis, suffix)¶
Steps to obtain counts from all bam files, by sequential or multi- processing. Inner function needed to get run-time counting.
- initiate_counting_for_sample()¶
Initiation function for multiprocessing for defining globals in each spawned process.
- process_bamfile()¶
Worker function in multiprocessing.
- classmethod check_bam_file(inbam)¶
- classmethod check_mulmap()¶
- classmethod init_counting(bins, kind, writebam, force, cigchk, nomis, suffix)¶
- classmethod run_region_count_processes(samples, chrnam, region, strand, comparereads, cigchk, nomis, suffix)¶
- coalispr.bedgraph_analyze.process_bamdata.count_bamfiles(bins, kind, writebam, force, cigchk, nomis, suffix)¶
Obtain count data from available bam-alignment files. When possible, uses multiprocessing, otherwise sequential counting of files.
- Parameters:
bins (int) – Number of bins counts are split over (default BINS)
kind (str) – Kind of read, SPECIFIC or UNSPECIFIC
writebam (bool) – Save UNSELECTED UNSPECIFIC reads to separate bam-alignment files for inclusion as a separate group of reads to display by showgraphs.
force (bool) – Allow recounting samples, backup existing count files.
cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.
nomis (int) – Number of tolerated substitutions, default: NRMISM.
suffix (str.) – Sets output file extension, BNY or TSV.
- coalispr.bedgraph_analyze.process_bamdata.count_region(samples, chrnam, region, strand, comparereads, cigchk, nomis, suffix)¶
Obtain read-length data for a particular region on chromosome chrnam for given samples.
- Parameters:
samples (list) – List of short names to retrieve bamfiles with alignment data for.
chrnam (str) – Name of chromosome to retrieve region from.
region (tuple) – Tuple with coordinates for chromosomal region to retrieve counts for.
strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).
comparereads (list) – List of reads to count for comparison.
cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.
nomis (int) – Number of tolerated substitutions, default: NRMISM.
suffix (str.) – Sets output file extension, BNY or TSV.
- Returns:
output – For making graphs, a pandas.Dataframe with read counts and a list of dataframes with counts for read lengths.
- Return type:
tuple