coalispr.bedgraph_analyze.process_bedgraphs¶

Module to process and compare bedgraph data of small RNA-sequencing samples.

With RNAi data as an example, bedgraph signals picked up by sequencing RNA isolated from strains without active RNAi are taken as noise, that is - in Cryptococcus (de)neoformans - in sequencing data for strains without a functioning Argonaute (Ago1), RNA-dependent RNA polymerase (Rdp1), or Dicer (Dcr1 and Dcr2). The reads in these samples are not specific, based on the finding that for these strains no siRNAs were detected by gamma-labeling or by Northern blotting of the RNA used as input for sequencing. The noise-data most likely derive from molecules that came along in the procedure and need to be treated as background. It is to be expected that these background signals will be present in all data sets, when these have been obtained by the same method.

Comparing bedgraph files of collapsed reads in a genome browser shows distinctive gaps in RNAi-minus tracks where there are distinctive peaks in RNAi-plus tracks. On the basis of this, we programmatically retrieve hit-regions by ‘subtracting’ RNAi-minus bedgraphs from RNAi-plus bedgraphs in Pandas. All magic to get bedgraph files into Pandas is in functions coalispr.bedgraph_analyze.process_bedgraphs.bin_bedgraphs() and coalispr.bedgraph_analyze.genom.create_genom_indexes().

Bedgraph values for reads in subsequent intervals defined by BINSTEP are collected for each strand. All bins where background (RNAi-minus) reads are found are emptied (dropped, and set to 0) unless the mean of the values for these reads in RNAi-plus samples is much higher. The constant UNSPECLOG10 defines the cutoff for this (10 ^UNSPECLOG10). This threshold can be set, tested and altered.

Data is stored persistently for reuse. Figures can be created easily. The program uses bedgraph values as generated by the STAR aligner.

Attributes¶

logger

Functions¶

`bin_bedgraphs`(framefilename, name)	Split bedgraphs in equal bins so that they can be compared to each other.
`process_graphs`(select, tag[, saveas, force])	Process bedgraphs and store processed data as pickle files.
`merge_sets`(select, tag[, saveasname, force])	For bedgraph comparison merge the data.
`process_reference`(tag[, minimal, force])	Pickle reference bedgraphs and filter unspecific hits.
`merge_reference`(tag[, tee, force])	Reference bedgraphs are combined but kept separate from the data.
`filter_negative_from_reference`(tag, refplus_merg, ...)	Split merged references along with specified data; store as pickle files.
`show_chr`(chrnam, setlist, tag[, refs, title, unsel, ...])	Plot all reads for both chromosome strands in one figure.
`show_specific_chr`(chrnam, setlist, tag[, refs, title, ...])	Plot specific reads for both chromosome strands in one figure.
`show_unspecific_chr`(chrnam, setlist, tag[, refs, ...])	Plot unspecific reads for both chromosome strands in one figure.

Module Contents¶

coalispr.bedgraph_analyze.process_bedgraphs.logger¶

coalispr.bedgraph_analyze.process_bedgraphs.bin_bedgraphs(framefilename, name)¶

Split bedgraphs in equal bins so that they can be compared to each other.

For this, split each bedgraph with respect to one and the same bin-sequence; Keep a connection between values and experiment: collect readings under the short ‘name’ heading.

Keep chrs separate because of the way indexing works (need equal lengths for multi-indexed, multi-dimensional arrays)

Parameters:

framefilename (str) – Complete name of file with raw bedgraph values (from bam alignment-file)
name (str) – Short name as header for column with summed bedgraph values

Returns:

Dictionary with as keys chromosome names and as values a dataframe with summed bedgraph values and a common genome index for a sample file

Return type:

dict

coalispr.bedgraph_analyze.process_bedgraphs.process_graphs(select, tag, saveas='data', force=False)¶

Process bedgraphs and store processed data as pickle files.

Load bedgraph files into dataframes.
Bin bedgraph values (gives a dict of chr:dataframe); slow, so store it:
Reuse and save as pickle for reuse.

So, only need to run this once.

Parameters:

select (list) – List of samples to process bedgraphs for.
tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
saveas (str) – Name for storing data, ‘data’ or ‘reference_data’.

Returns:

Print message upon completion of function.

Return type:

None

coalispr.bedgraph_analyze.process_bedgraphs.merge_sets(select, tag, saveasname=None, force=False)¶

For bedgraph comparison merge the data.

merge plus files; merge minus files
this has to be done for each chromosome
store in dictionaries (only for non-overlapping sets to save space)
allow merged sets outwith the major dataset

Parameters:

select (list) – List of samples to process bedgraphs for.
tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
saveas (str) – Name for storing data, like ‘merged’.

Returns:

Print message upon completion of function.

Return type:

None

coalispr.bedgraph_analyze.process_bedgraphs.process_reference(tag, minimal=True, force=False)¶

Pickle reference bedgraphs and filter unspecific hits.

Set ‘tee’ to True when stored data has to be used directly which happens when storing the data in this combined run.

Parameters:

tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
minimal (bool) – Flag defining negative-samples list.

Returns:

Print message upon completion of function.

Return type:

None

coalispr.bedgraph_analyze.process_bedgraphs.merge_reference(tag, tee=False, force=False)¶

Reference bedgraphs are combined but kept separate from the data.

Parameters:

tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
tee (bool) – Flag defining to return dataframes for immediate use (storing them does not allow for this).

Returns:

When tee is True, dicts, one for each strand, with dataframes, one for each chromosome, with merged data for reference samples. Otherwise, when tee is False, print message upon completion of function.

Return type:

dict, dict

coalispr.bedgraph_analyze.process_bedgraphs.filter_negative_from_reference(tag, refplus_merg, refminus_merg, minimal=True)¶

Split merged references along with specified data; store as pickle files.

Parameters:

tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
refplus_merg (dict, dict) – Dicts, for PLUS and MINUS strand, with data of merged references.
refminus_merg (dict, dict) – Dicts, for PLUS and MINUS strand, with data of merged references.
minimal (bool) – Flag defining negative-samples list.

Returns:

Print message upon completion of function.

Return type:

None

coalispr.bedgraph_analyze.process_bedgraphs.show_chr(chrnam, setlist, tag, refs=False, title=PLOTALL, unsel=False, dowhat='show', scale=None, lim=None, ridx=None, side=None)¶

Plot all reads for both chromosome strands in one figure.

These are interactive plots with all signals, i.e. unfiltered reads.

Parameters:

chrnam (str) – Name of chromosome to display bedgraph traces for.
setlist (list) – List of samples to display traces for.
setname (str) – Name of displayed sample set; will be part of figure title.
tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
refs (bool) – Flag to include (when True) reference data.
unsel (bool) – Flag to include (when True) unselected data.
title (str) – First section of figure title, set to PLOTALL.
dowhat (str) – Instruction to ‘show’ (default), ‘save’ (as .png), ‘savesvg’ or ‘return’ the figure.
scale (str) – Set scale of y-axis to linear or log2
lim (int) – Set limit of y-axis
ridx (list) – List with boundaries for a region to be shown, if None, show whole of the chromosome.
side (list) – The sidepatcheslist, a list describing groups of samples to be shown under separate headings in the side panel.

coalispr.bedgraph_analyze.process_bedgraphs.show_specific_chr(chrnam, setlist, tag, refs=False, title=PLOTSPEC, unsel=False, dowhat='show', scale=None, lim=None, ridx=None, side=None)¶

Plot specific reads for both chromosome strands in one figure.

Interactive plots with all specific signals, i.e. filtered reads; without signals that overlap with negative-controls.

Parameters:

chrnam (str) – Name of chromosome to display bedgraph traces for.
setlist (list) – List of samples to display traces for.
setname (str) – Name of displayed sample set; will be part of figure title.
tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL.
refs (bool) – Flag to include (when True) reference data.
title (str) – First section of figure title, set to PLOTSPEC.
dowhat (str) – Instruction to ‘show’ (default), ‘save’ (as .png), ‘savesvg’ or ‘return’ the figure.
scale (str) – Set scale of y-axis to linear or log2
lim (int) – Set limit of y-axis
ridx (list) – List with boundaries for a region to be shown, if None, show whole of the chromosome.
side (list) – The sidepatcheslist, a list describing groups of samples to be shown under separate headings in the side panel.

coalispr.bedgraph_analyze.process_bedgraphs.show_unspecific_chr(chrnam, setlist, tag, refs=False, unsel=False, title=PLOTUNSP, dowhat='show', scale=None, lim=None, ridx=None, side=None)¶

Plot unspecific reads for both chromosome strands in one figure.

Interactive plots with all unspecific signals, i.e. negative control data and reads that overlap with negative-control signals but do not meet the thresholds. This will also contain reference info if that exists.

Parameters:

chrnam (str) – Name of chromosome to display bedgraph traces for.
setlist (list) – List of samples to display traces for.
setname (str) – Name of displayed sample set; will be part of figure title.
tag (str) – Flag TAG to indicate kind of aligned-reads, TAGUNCOLL or TAGCOLL .
refs (bool) – Flag to include (when True) reference data.
unsel (bool) – Flag to include (when True) unselected data from negative control samples.
title (str) – First section of figure title, set to PLOTUNSP.
dowhat (str) – Instruction to ‘show’ (default), ‘save’ (as .png), ‘savesvg’ or ‘return’ the figure.
scale (str) – Set scale of y-axis to linear or log2
lim (int) – Set limit of y-axis
ridx (list) – List with boundaries for a region to be shown, if None, show whole of the chromosome.
side (list) – The sidepatcheslist, a list describing groups of samples to be shown under separate headings in the side panel.