coalispr.bedgraph_analyze.process_bedgraphs¶
Module to process and compare bedgraph data of small RNA-sequencing samples.
With RNAi data as an example, bedgraph signals picked up by sequencing RNA isolated from strains without active RNAi are taken as noise, that is - in Cryptococcus (de)neoformans - in sequencing data for strains without a functioning Argonaute (Ago1), RNA-dependent RNA polymerase (Rdp1), or Dicer (Dcr1 and Dcr2). The reads in these samples are not specific, based on the finding that for these strains no siRNAs were detected by gamma-labeling or by Northern blotting of the RNA used as input for sequencing. The noise-data most likely derive from molecules that came along in the procedure and need to be treated as background. It is to be expected that these background signals will be present in all data sets, when these have been obtained by the same method.
Comparing bedgraph files of collapsed reads in a genome browser shows
distinctive gaps in RNAi-minus tracks where there are distinctive peaks
in RNAi-plus tracks. On the basis of this, we programmatically retrieve
hit-regions by ‘subtracting’ RNAi-minus bedgraphs from RNAi-plus bedgraphs
in Pandas. All magic to get bedgraph files into Pandas is in functions
coalispr.bedgraph_analyze.process_bedgraphs.bin_bedgraphs() and
coalispr.bedgraph_analyze.genom.create_genom_indexes().
Bedgraph values for reads in subsequent intervals defined by BINSTEP are collected for each strand. All bins where background (RNAi-minus) reads are found are emptied (dropped, and set to 0) unless the mean of the values for these reads in RNAi-plus samples is much higher. The constant UNSPECLOG10 defines the cutoff for this (10 UNSPECLOG10). This threshold can be set, tested and altered.
Data is stored persistently for reuse. Figures can be created easily. The program uses bedgraph values as generated by the STAR aligner.
Attributes¶
Classes¶
Process bedgraphs and store processed data as binary files. |
Functions¶
|
Split bedgraphs in equal bins so that they can be compared to each other. |
|
Multiprocess bedgraphs. |
|
Store reference bedgraphs and filter unspecific hits. |
Split merged references along with specified data; store as binary files. |
|
|
Store unselectede bedgraphs for reads identified during counting of |
Module Contents¶
- coalispr.bedgraph_analyze.process_bedgraphs.logger¶
- coalispr.bedgraph_analyze.process_bedgraphs.when_has_been_done¶
- coalispr.bedgraph_analyze.process_bedgraphs.bin_bedgraphs(framefilename, name, intervals)¶
Split bedgraphs in equal bins so that they can be compared to each other.
For this, split each bedgraph with respect to one and the same bin-sequence; Keep a connection between values and experiment: collect readings under the short ‘name’ heading.
Keep chrs separate because of the way indexing works (need equal lengths for multi-indexed, multi-dimensional arrays)
- Parameters:
framefilename (str) – Complete name of file with raw bedgraph values (from bam alignment-file)
name (str) – Short name as header for column with summed bedgraph values
intervals (dict) – Dict of temporary filenames for chromosomal intervals.
- Returns:
Dictionary with as keys chromosome names and as values a dataframe with summed bedgraph values and a common genome index for a sample file
- Return type:
dict
- class coalispr.bedgraph_analyze.process_bedgraphs.Bedgraph_processor¶
Bases:
objectProcess bedgraphs and store processed data as binary files.
Load bedgraph files into dataframes.
Bin bedgraph values (gives a dict of chr:dataframe); slow, so try parallel (multi) processing, then store binned data in a namedtuple.
Merge directly (each binned bedgraph data will be in the column) This reduces storage, replaces slow I/O (file writing after multiprocessing and file loading) before an otherwise quick merge.
So, only need to run this once.
- saveas: str¶
- tag: str¶
- sequential: bool¶
- notag: bool¶
- todo: list¶
- bedgraphdict1: dict¶
- bedgraphdict2: dict¶
- intervals: dict¶
- SDlist: list¶
- classmethod process_graphs(select, sequential, tag, saveas='merged')¶
Multiprocess bedgraphs.
- Parameters:
select (list or str) – Sample(s) to process bedgrap[hs for.
sequential (bool) – Flag to bypass multiprocessing if memory demand becomes too high.
tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
saveas (str) – Name for merged data, ‘merged’ or ‘reference_merged’.
- Returns:
None (if not tee) – Print message upon completion of pipeline.
pandas.DataFrame, pandas.DataFrame (if tee) – Merged dataframes.
- initiate_graphs_for_sample()¶
Initiation function for multiprocessing for defining globals in each spawned process.
- process_graphs_for_sample()¶
Worker function in multiprocessing, process sample bedgraphs .
- Parameters:
name (str) – Sample to process bedgraphs for.
- Returns:
SData – Tuple with fields “sample”, “sample_plus_frames”, and “sample_minus_frames”
- Return type:
namedtuple
- classmethod merge_sets()¶
For bedgraph comparison merge the data; called from process_graphs.
merge plus data (dict1); merge minus data (dict2) of all samples.
this has to be done for each chromosome.
sample name is the header of each data column to be merged.
store in folder with telling name.
Notes
- Input is cls.SDlist:
List of namedtuples with sample data to merge: [ cls.SData( dict1, dict2),
cls.SData( dict1, dict2), ..
]
- Returns:
Print message upon completion of function.
- Return type:
None
- coalispr.bedgraph_analyze.process_bedgraphs.process_bedgraphs(select, sequential, tag, saveas='merged')¶
Multiprocess bedgraphs.
- Parameters:
select (list or str) – Sample(s) to process bedgrap[hs for.
tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
sequential (bool) – Flag to bypass multiprocessing if memory demand becomes too high.
saveas (str) – Name for merged data, ‘merged’ or ‘reference_merged’.
- Returns:
None (if not tee) – Print message upon completion of pipeline.
pandas.DataFrame, pandas.DataFrame (if tee) – Merged dataframes.
- coalispr.bedgraph_analyze.process_bedgraphs.process_reference(sequential, tag)¶
Store reference bedgraphs and filter unspecific hits.
- Parameters:
tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
- Returns:
Print message upon completion of function.
- Return type:
None
- coalispr.bedgraph_analyze.process_bedgraphs.filter_negative_from_reference(tag)¶
Split merged references along with specified data; store as binary files.
- Parameters:
tag (str) – Flag to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
refplus_merg (dict, dict) – Dicts, for PLUS and MINUS strand, with data of merged references.
refminus_merg (dict, dict) – Dicts, for PLUS and MINUS strand, with data of merged references.
- Returns:
Print message upon completion of function.
- Return type:
None
- coalispr.bedgraph_analyze.process_bedgraphs.process_unselected(sequential=False)¶
Store unselectede bedgraphs for reads identified during counting of TAGBAM UNSPECIFIC bam files.
Notes
- tagstr
Flag TAGBBAM to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
- Parameters:
sequential (bool) – Flag to bypass multiprocessing if memory demand becomes too high.
- Returns:
Print message upon completion of function.
- Return type:
None