coalispr.bedgraph_analyze.compare

Module with functions used to compare bedgraph-data.

Attributes

Classes

Regions_gatherer

Collection of methods to obtain and save region boundaries for counting.

Functions

get_indexes(df[, keep])

Provide indexes for specific and unspecific data sets.

specific(chrnam, tag, setlist[, maincut, keep])

Get merged dataframes with specific reads for a chromosome.

unspecific(chrnam, tag, setlist[, maincut, keep])

Get merged dataframes with unspecific reads for a chromosome.

Module Contents

coalispr.bedgraph_analyze.compare.encoding = 'utf-8'
coalispr.bedgraph_analyze.compare.logger
coalispr.bedgraph_analyze.compare.get_indexes(df, keep='exps')

Provide indexes for specific and unspecific data sets.

Parameters:
  • df (pandas.DataFrame) – The pandas dataframe from which read-indexes are retrieved.

  • keep (str (default: 'exps')) – Indicates the group of reads (experiments or references) for which indexes need to be returned.

Returns:

Tuple of pandas indexes for specific resp. unspecific reads present in input dataframe.

Return type:

pandas.Index, pandas.Index

coalispr.bedgraph_analyze.compare.specific(chrnam, tag, setlist, maincut=UNSPECLOG10, keep='exps')

Get merged dataframes with specific reads for a chromosome.

Returned are data for both chromosomes for a list of given samples.

Parameters:
  • chrnam (str) – Name of chromosome to return data for.

  • tag (str) – Type of reads, TAG (default) or TAGCOLL (‘collapsed’) or TAGUNCOLL (‘uncollapsed’).

  • setlist (list) – Group of samples for which data is returned.

  • maincut (float) – Exponent for log10-difference between specific and non-specific reads.

  • keep (str) – Flag to indicate what samples to keep in a dataset that is compared to UNSPECIFIC reads..

Returns:

List with two dataframes with specific reads from merged samples for PLUS resp. MINUS strands.

Return type:

list of pd.DataFrames

coalispr.bedgraph_analyze.compare.unspecific(chrnam, tag, setlist, maincut=UNSPECLOG10, keep='exps')

Get merged dataframes with unspecific reads for a chromosome.

Returned are data for both chromosomes for a list of given samples.

Parameters:
  • chrnam (str) – Name of chromosome to return data for.

  • tag (str) – Type of reads, TAG (default) or TAGCOLL (‘collapsed’) or TAGUNCOLL (‘uncollapsed’).

  • setlist (list) – Group of samples for which data is returned.

  • maincut (float) – Exponent for log10-difference between specific and non-specific reads.

  • keep (str) – Flag to indicate what samples to keep in a dataset that is compared to UNSPECIFIC reads.

Returns:

List of dataframes with unspecific reads from merged samples for PLUS resp. MINUS strands.

Return type:

list

class coalispr.bedgraph_analyze.compare.Regions_gatherer

Collection of methods to obtain and save region boundaries for counting.

The output are (TSV) text files with tab-separated values defining regions with specified reads. Discarded samples (CAT_D) are bypassed, but all other samples of the dataset are used for specifying reads as SPECIFIC or UNSPECIFIC.

Specifying is based on parameters that are defined in the configuration file 3_EXP.txt and used to assess read levels per BIN:

**LOG2BG** in **LOG2BGTST** (for setting base threshold),
**UNSPECLOG10** in **UNSPECTST** (for setting cutoff to
mark peak differences between NEGATIVE and POSITIVE as relevant)

USEGAPS or UNSPCGAPS in UGAPSTST (for setting length of a

contiguous region,i.e. region counted as one, minimal 1 x BIN).

For very short reads (like for miRNA genes) that map near one of the bin edges, regions will not be detectable with default parameters only. To enable counting of these RNAs, parameters BINSTEP and MIRNAPKBUF can be configured.

Parameters:
  • tag (str) – Type of reads, TAG (default) or TAGCOLL (‘collapsed’) or TAGUNCOLL (‘uncollapsed’).

  • dowhat (str) – Instruction how to output data; ‘tsv’: write tabbed separated values to a text file; ‘test’: print total number of regions found to test_intervals.tsv.

  • cutoff (float) – Threshold (2^cutoff) for read-signals above which reads are considered (default: LOG2BG).

  • maincut (float) – Threshold (10^maincut) for difference between read-signals from aligned-reads in wild type or mutant samples and those of unspecific (negative control) samples above which reads are considered ‘specific’. (default: UNSPECLOG10).

  • gaps (int) – Length of tolerated sections without reads separating peak-regions that form a contiguous segment of specified reads (default: USEGAPS).

indxnams = ['TAG', 'KIND', 'UNSPECLOG10', 'LOG2BG', 'THRESHOLD', 'USEGAPS']
thrsh
class PIndx

Bases: tuple

TAG
KIND
UNSPECLOG10
LOG2BG
THRESHOLD
USEGAPS
classmethod get_all_specific_regions(tag, dowhat='tsv')

Get specific regions for a dataset with configured settings.

classmethod get_all_unspecific_regions(tag, dowhat='tsv')

Get unspecific regions for a dataset with configured settings.

classmethod check_intervals(idxs_list)

Function to be run in parallel on different idxs_lists.

classmethod test_intervals(logs10=None, gaps=None, thresh=None, force=False)

Try out various settings for UNSPECLOG10, USEGAPS and LOG2BG.

Output are TSV text files with tab-separated values for TAG, ‘ KIND’, UNSPECLOG10, LOG2BG, USEGAPS, THRESHOLD in relation to the number of independent regions (REGS) of specified reads that are picked up with combinations of these settings. Produces input for show_regions_vs_settings.

With inner function _test_intervals() to enable timer.

Parameters:
  • logs10 (list) – List of possible settings for UNSPECLOG10, set to UNSPECTST.

  • gaps (list) – List of possible settings for USEGAPS, set to UGAPSTST.

  • thresh (tuple) – List of possible settings for LOG2BG, set to LOG2BGTST.