coalispr.bedgraph_analyze.process_bamdata¶
Module to count bam files based on specification of aligned reads.
Attributes¶
Classes¶
Class to keep track of bam files. |
|
Class to store outcome of cigar and other read checks. |
|
Class to create bedgraphs from new bam-files with unselected reads. |
Functions¶
|
Obtain total mapped reads and unmapped reads from alignments. |
|
Return folder with stored count files |
|
Check whether count files have been created. |
|
Obtain count data from available bam-alignment files. When possible, |
|
Obtain read-length data for a particular region on chromosome chrnam |
Module Contents¶
- coalispr.bedgraph_analyze.process_bamdata.logger¶
- class coalispr.bedgraph_analyze.process_bamdata.Bam_collector¶
Class to keep track of bam files.
- bamkeys¶
- nodiscards_keys¶
- bamfiles¶
- tag¶
- bamkeys: list¶
- nodiscards_keys: list¶
- bamfiles: dict¶
- tag: str¶
- classmethod collect_bamfiles(tag=TAGBAM, src_dir=SRCDIR, ndirlevels=SRCNDIRLEVEL)¶
Retrieve all bam-file names for counting aligned reads.
These are marked by SAMBAM.
- Parameters:
tag (str (default: TAGBAM)) – Type of aligned-reads (collapsed or uncollapsed).
src_dir (Path (default: SRCDIR)) – Path to folder with sequencing data incl. bamfiles.
ndirlevels (int (default: SRCNDIRLEVEL)) – Number of subdirectories to traverse from SRC directory to get to bamfiles.
- Returns:
Dictionary of sample (SHORT) names and paths to associated SAMBAM-files.
- Return type:
dict
- classmethod get_bamfiles(tag)¶
- classmethod num_counted_libs(plusdiscards=True)¶
Retrieve number of counted libraries from number counted bam-files.
- classmethod keys_counted_libs(plusdiscards=True)¶
Retrieve keys linked to counted bam-files.
- coalispr.bedgraph_analyze.process_bamdata.total_raw_counts(tagBam, stranded, force=False)¶
Obtain total mapped reads and unmapped reads from alignments.
- Parameters:
tagBam (str) – Determines type of reads counted, TAGUNCOLL, or TAGCOLL.
stranded (bool) – Flag to output total counts for each strand.
force (bool) – Flag to redo counting (while backing up previous results).
- Returns:
A text file with tab-separated columns giving total input numbers for all experiments.
- Return type:
A TSV file
- class coalispr.bedgraph_analyze.process_bamdata.Total_counter¶
- samheader_counters: list¶
- unmapped_counters: list¶
- bams: dict¶
- tagBam: str¶
- tsvpath: pathlib.Path¶
- filepath: pathlib.Path¶
- keys: list¶
- unmap_in_sam: bool¶
- suffuncs¶
- totalfram: pandas.DataFrame¶
- classmethod init_raw_count(tagBam, stranded, force)¶
Set up common variables.
- classmethod get_samheader_counts(key)¶
Get counts from SAM header or from line counting in BAM file.
- classmethod process_samheader_counts()¶
Create dataframe from obtained counts.
- classmethod count_unmapped_lines(key)¶
Count lines in file with unmapped reads, put apart during alignment.
- classmethod external_unmapped_counts()¶
Obtain counts for unmapped reads put aside in UNMAPPEDFIL, another file than the BAM alignment file.
- classmethod process_external_unmapped()¶
Process counts obtained for UNMAPPEDFIL reads.
- classmethod process_totalfram()¶
Save dataframe with all counts to file.
- classmethod get_raw_counts(tagBam=None, stranded=False, force=False)¶
- coalispr.bedgraph_analyze.process_bamdata.count_folder(kind, bam, segments, overmax, maincut, usegaps)¶
Return folder with stored count files
- coalispr.bedgraph_analyze.process_bamdata.has_been_counted(typeofcount, kind, folder=None)¶
Check whether count files have been created.
- Parameters:
typeofcount (str) – Pattern to find specific files
kind (str) – Select kind of reads that have been counted, either SPECIFIC or UNSPECIFIC
- Return type:
boolean to indicate count file is present (True) or not (False)
- class coalispr.bedgraph_analyze.process_bamdata.Read_checker¶
Class to store outcome of cigar and other read checks.
- nomis¶
Indicates number of mismatches (default: NRMISM)
- Type:
int
- okintrons¶
- Type:
list
- unfit_read = None¶
- nomis: int¶
- okintrons: list¶
- classmethod make_cigarcheck(cigchk, nomis=NRMISM)¶
Take cigar items as marked by ‘cigartuples; cigarstring; <meaning>’:
0;M <match>, 1;I <insertion>, 2;D <deletion>, 3;N <skipped>,
which are standard (the other accepted cigar items:
4;S <soft clip>, 5;H <hard clip>, 6;P <padding>, 7;= <sequence match>, 8;X <sequence mismatch, substitution>
are indirectly used here. Skip if alignment is dubious; for short SE sequences only accept matches (0;M) and gaps (3;N); for reads from UV-treated samples accept a point-deletion (2;D)
- Parameters:
cigchk (str (from CIGARCHK, either CIGPD or CIGFM)) – Defines function to use for checking a read.
nomis (int (default: NRMISM))
- classmethod ok_read(read, intronchk=True)¶
Check cigar and number of tolerated mismatches for each read.
- Parameters:
read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.
intronchk (bool) – Check for introns and gather their lengths (or not).
- Return type:
True or False and sets list of intron-lengths for valid introns
- classmethod hitidx(hit_idx, beancounter)¶
Check hit-index for read as defined in alignment file.
- classmethod ok_strand(read, strand)¶
Check strand of read for inclusion in counts
- Parameters:
read (pysam.AlignedSegment) – Input for obtaining tuples describing the cigar string of an aligned read, number of mismatches and introns.
strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).
- Return type:
True or False
- class coalispr.bedgraph_analyze.process_bamdata.Bam_countprocessor¶
- bampath: pathlib.Path¶
- bampeak: tuple¶
- bams: dict¶
- bamstart: int¶
- collected_counters: list¶
- comparereads: list¶
- countselect: bool¶
- bins: int¶
- cols: list¶
- force: bool¶
- gap: int¶
- has_chrxtra: bool¶
- keys: list¶
- kind: str¶
- region: str¶
- segs: dict¶
- selectbam: pysam.AlignmentFile¶
- strand: str¶
- TEST: bool¶
- tsvpath: pathlib.Path¶
- writebam: bool¶
- cigchk = 'fullmatch'¶
- maincut = 1¶
- nomis = 0¶
- tresh = 5¶
- classmethod run_all_bamcount_processes(bins, kind, writebam, force, cigchk, nomis)¶
Steps to obtain counts from all bam files, by sequential or multi- processing. Inner function needed to get run-time counting.
- classmethod process_bamfile(key)¶
Multiprocess this part.
- classmethod check_bam_file(inbam)¶
- classmethod check_mulmap()¶
- classmethod init_counting(bins, kind, writebam, force, cigchk, nomis)¶
- classmethod run_region_count_processes(samples, chrnam, region, strand, comparereads, cigchk, nomis)¶
- coalispr.bedgraph_analyze.process_bamdata.process_bam_files(bins, kind, writebam, force, cigchk, nomis)¶
Obtain count data from available bam-alignment files. When possible, uses multi-processing, otherwise sequential counting of files.
- Parameters:
bins (int) – Number of bins counts are split over (default BINS)
kind (str) – Kind of read, SPECIFIC or UNSPECIFIC
writebam (bool) – Save UNSELECTED UNSPECIFIC reads in bam-alignment files for inclusion to showgraphs as a separate group of reads.
force (bool) – Allow recounting samples, backup existing count files.
cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.
nomis (int) – Number of tolerated substitutions, default: NRMISM.
- coalispr.bedgraph_analyze.process_bamdata.process_reads_for_region(samples, chrnam, region, strand, comparereads, cigchk, nomis)¶
Obtain read-length data for a particular region on chromosome chrnam for given samples.
- Parameters:
samples (list) – List of short names to retrieve bamfiles with alignment data for.
chrnam (str) – Name of chromosome to retrieve region from.
region (tuple) – Tuple with coordinates for chromosomal region to retrieve counts for.
strand (str) – One of COMBI, PLUS or MINUS; for selecting sense/antisense in defined sections (for which the MUNR and CORB properties are neither set nor applicable).
comparereads (list) – List of reads to count for comparison.
cigchk (str) – From cigar string CIGARCHK, either CIGPD or CIGFM, defines function to use for checking a read.
nomis (int) – Number of tolerated substitutions, default: NRMISM.
- Returns:
output – For making graphs, a pandas.Dataframe with read counts and a list of dataframes with counts for read lengths.
- Return type:
tuple
- class coalispr.bedgraph_analyze.process_bamdata.Bedgraphs_from_unselected¶
Class to create bedgraphs from new bam-files with unselected reads.
Attributes:¶
- bampath: Path
Path to input bam-file.
- classmethod sortindex(bampath)¶
- classmethod bedgraphs_from_xtra_bamdata(bampath)¶
Create bedgraph files from selected bamdata.
During specification of reads, target-like RNAs (like siRNAs) may be thrown out because of overlap with more abundant reads in the negative controls. Based on a telling determinant (like start-nucleotide and length range for siRNA) such target-like reads can be retrieved during counting of unspecified reads and copied to new bam files. Here, these reads are extracted and processed. The fraction of such target-like reads may come along by chance, and therefore could represent false positives, especially when these do not stand out for positive controls or change by interfering mutations/conditions.
Bam files need to be sorted and indexed before they can be converted to bedgraphs