coalispr.bedgraph_analyze.unselected

Module for dealing with unselected reads retrieved from unspecific data.

Attributes

Classes

Bedgraphs_from_unselected

Class to create bedgraphs from new bam-files with unselected reads.

Functions

has_unselected()

Returns True if merged, unselected data are available.

Module Contents

coalispr.bedgraph_analyze.unselected.logger
coalispr.bedgraph_analyze.unselected.has_unselected()

Returns True if merged, unselected data are available.

class coalispr.bedgraph_analyze.unselected.Bedgraphs_from_unselected

Class to create bedgraphs from new bam-files with unselected reads.

Attributes:

bampath: Path

Path to input bam-file.

#input_totals: pd.DataFrame # Frame with total mapped counts for each sample. # Not working in # parallellization if not passed through as a global via initiator. samples_count: int

Number of samples

bampath: pathlib.Path
samples = []
classmethod getkeysource(bam)

Retrieve extra bamfiles and prepare these for common analysis. Because these files describe target-like RNAs (say siRNAs) present in UNSPECIFIC reads, only these kind of reads will be considered. As counting by default will be done on TAGBAM bam files, as set in the configuration, this is the tag searched for. Files will be stored as: |----src--|—key–| |---tag---| ‘unspecific_{sample}_selected_collapsed.bam’.

bam: Path

Path to extra bam files.

make_data(bgnam0, bgnam1, plusbg, minbg)

Make bedgrahs with STAR; not suitable as worker in multiprocessing.

classmethod bedgraphs_from_xtra_bamdata()

Create bedgraph files from selected bamdata.

During specification of reads, target-like RNAs (like siRNAs) may be thrown out because of overlap with more abundant reads in the negative controls. Based on a telling determinant (like start-nucleotide and length range for siRNA) such target-like reads can be retrieved during counting of unspecified reads and copied to new bam files. Here, these reads are extracted and processed. The fraction of such target-like reads may come along by chance, and therefore could represent false positives, especially when these do not stand out for positive controls or change by interfering mutations/conditions.

Bam files need to be sorted and indexed before they can be converted to bedgraphs

classmethod get_data()
get_inputcounts()

Get dataframe with total input counts from file

classmethod normalize_unselected(frames)

Create RPM values for raw bedgraph-data based on total mapped reads.

Normally, RPM output is based on total input of the actual bam file, which -for unselected reads- is fairly low and would produce signals that are too high and not - in comparison to the bedgraph values derived from the original bam-alignments - a good indication of the relevance of these reads. Therefore, take total mapped reads as the RPM-standard to normalise unselected bedgraph values.

Return a dict of chromosome names as key and as value dataframes with normalized (RPM) values in the column with sample name as header.

Parameters:

frames (dict) – Dict with chromosome names as key and as value, a dataframe with chromosome coordinates as an index and one column with a sample SHORT name as header, for either PLUS or MINUS strand.

Returns:

RPM-based frames

Return type:

dict