coalispr.bedgraph_analyze.process_bamdata¶
Module to count bam files based on specification of aligned reads.
Attributes¶
Classes¶
Class to keep track of bam files. |
|
Class to store outcome of cigar and other read checks. |
|
Functions¶
|
Obtain total mapped reads and unmapped reads from alignments. |
|
Return folder with stored count files |
|
Check whether count files have been created. |
|
Obtain count data from available bam-alignment files. When possible, |
|
Obtain read-length data for a particular region on chromosome chrnam |
|
Create bedgraph files from selected bamdata. |
Module Contents¶
- coalispr.bedgraph_analyze.process_bamdata.logger¶
- class coalispr.bedgraph_analyze.process_bamdata.Bam_collector¶
Class to keep track of bam files.
- bamkeys¶
- nodiscards_keys¶
- bamfiles¶
- tag¶
- bamkeys: list¶
- nodiscards_keys: list¶
- bamfiles: dict¶
- tag: str¶
- classmethod collect_bamfiles(tag=TAGBAM, src_dir=SRCDIR, ndirlevels=SRCNDIRLEVEL)¶
Retrieve all bam-file names for counting aligned reads.
These are marked by SAMBAM.
- Parameters:
tag (str (default: TAGBAM)) – Sort of aligned-reads (collapsed or uncollapsed).
src_dir (Path (default: SRCDIR)) – Path to folder with sequencing data incl. bamfiles.
ndirlevels (int (default: SRCNDIRLEVEL)) – Number of subdirectories to traverse from SRC directory to get to bamfiles.
- Returns:
Dictionary of sample (SHORT) names and paths to associated SAMBAM-files.
- Return type:
dict
- classmethod get_bamfiles(tag=TAGBAM)¶
- classmethod num_counted_libs(plusdiscards=True)¶
Retrieve number of counted libraries from number counted bam-files.
- classmethod keys_counted_libs(plusdiscards=True)¶
Retrieve keys linked to counted bam-files.
- coalispr.bedgraph_analyze.process_bamdata.total_raw_counts(tagBam=None, stranded=False, force=False)¶
Obtain total mapped reads and unmapped reads from alignments.
- Returns:
A text file with tab-separated columns giving total input numbers for all experiments.
- Return type:
A TSV file
- coalispr.bedgraph_analyze.process_bamdata.count_folder(kind, bam, segments, overmax, maincut, usegaps)¶
Return folder with stored count files
- coalispr.bedgraph_analyze.process_bamdata.has_been_counted(typeofcount='', kind=SPECIFIC)¶
Check whether count files have been created.
- Parameters:
typeofcount (str) – Pattern to find specific files
kind (str) – Selct kind of reads that have been counted, either SPECIFIC or UNSPECIFIC
- Return type:
boolean to indicate count file is present (True) or not (False)
- class coalispr.bedgraph_analyze.process_bamdata.Read_checker¶
Class to store outcome of cigar and other read checks.
- nomis¶
Indicates number of mismatches (default: NRMISM)
- Type:
int
- okintrons¶
- Type:
list
- unfit_read = None¶
- nomis: int¶
- okintrons: list¶
- classmethod make_cigarcheck(cigchk, nomis=NRMISM)¶
Take cigar items as marked by ‘cigartuples; cigarstring; <meaning>’:
0;M <match>, 1;I <insertion>, 2;D <deletion>, 3;N <skipped>,
which are standard (the other accepted cigar items:
4;S <soft clip>, 5;H <hard clip>, 6;P <padding>, 7;= <sequence match>, 8;X <sequence mismatch, substitution>
are indirectly used here. Skip if alignment is dubious; for short SE sequences only accept matches (0;M) and gaps (3;N); for reads from UV-treated samples accept a point-deletion (2;D)
- Parameters:
cigchk (str (from CIGARCHK, either CIGPD or CIGFM)) – Defines function to use for checking a read.
nomis (int (default: NRMISM))
- classmethod ok_read(read, intronchk=True)¶
Check cigar and number of tolerated mismatches for each read.
- Parameters:
read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.
intronchk (bool) – Check for introns and gather their lengths (or not).
- Return type:
True or False and sets list of intron-lengths for valid introns
- classmethod ok_strand(read, strand)¶
Check strand of read for inclusion in counts
- Parameters:
read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.
strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).
- Return type:
True or False
- class coalispr.bedgraph_analyze.process_bamdata.Bam_countprocessor¶
- bampath: pathlib.Path¶
- bampeak: tuple¶
- bams: dict¶
- bamstart: int¶
- collected_counters: list¶
- comparereads: list¶
- bins: int¶
- cols: list¶
- force: bool¶
- gap: int¶
- keys: list¶
- kind: str¶
- region: str¶
- segs: dict¶
- selectbam: pysam.AlignmentFile¶
- strand: str¶
- tagBam: str¶
- tagSeg: str¶
- TEST: bool¶
- tsvpath: pathlib.Path¶
- writebam: bool¶
- cigchk = 'fullmatch'¶
- maincut = 0.78¶
- nomis = 0¶
- tresh = 5¶
- classmethod run_all_bamcount_processes(bins, kind, writebam, force, cigchk, nomis)¶
Steps to obtain counts from all bam files, by sequential or multi- processing.
- classmethod process_bamfile(key)¶
- classmethod run_region_count_processes(samples, chrnam, region, strand, comparereads, cigchk, nomis)¶
- classmethod check_bam_file(inbam)¶
- classmethod init_counting(bins, kind, writebam, force, cigchk, nomis)¶
- coalispr.bedgraph_analyze.process_bamdata.process_bam_files(bins, kind, writebam, force, cigchk, nomis)¶
Obtain count data from available bam-alignment files. When possible, uses multi-processing, otherwise sequential counting of files.
- Parameters:
bins (int) – Number of bins counts are split over (default BINS)
kind (str) – Kind of read, SPECIFIC or UNSPECIFIC
writebam (bool) – Save UNSELECTED UNSPECIFIC reads in bam-alignment files for inclusion to showgraphs as a separate group of reads.
force (bool) – Allow recounting samples, backup existing count files.
cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.
nomis (int) – Number of tolerated substitutions, default: NRMISM.
- coalispr.bedgraph_analyze.process_bamdata.process_reads_for_region(samples, chrnam, region, strand, comparereads, cigchk, nomis)¶
Obtain read-length data for a particular region on chromosome chrnam for given samples.
- Parameters:
samples (list) – List of short names to retrieve bamfiles with alignment data for.
chrnam (str) – Name of chromosome to retrieve region from.
region (tuple) – Tuple with coordinates for chromosomal region to retrieve counts for.
strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).
comparereads (list) – List of reads to count for comparison.
cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.
nomis (int) – Number of tolerated substitutions, default: NRMISM.
- Returns:
output – For making graphs, a pandas.Dataframe with read counts and a list of dataframes with counts for read lengths.
- Return type:
tuple
- coalispr.bedgraph_analyze.process_bamdata.bedgraphs_from_xtra_bamdata(bampath, force=False)¶
Create bedgraph files from selected bamdata.
During specification of reads, genuine siRNAs can be thrown out due to overlap with unspecific reads even if these would not be siRNAs. Thus, based on start-nucleotide and length range, siRNAs can be retrieved during counting of unspecified reads and copied to new bam files. Here, extract and process these reads.
Bam files need to be sorted and indexed before they can be converted to bedgraphs