coalispr.bedgraph_analyze.genom¶

This module produces genome descriptors used for parsing and imaging.

Attributes¶

logger

Classes¶

`Track`	A class to represent a track of segments.
`SegmentTrack`	A class to represent a track of countable segments.
`GTFtrack`	A Track class to represent a track with annotation info from a GTF file.

Functions¶

`chroms`()	Lists chromosome names.
`get_lengths`()	Gives a dict of chromosome lengths.
`smallest2chroms`()
`create_genom_indexes`([do_interval])	Generates range or interval indexes and a chromosome list.
`get_genom_indexes`([do_interval])	Return indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp
`chr_test`()	Get a chromosome name as found in original gtf/bedgraphs.
`retrieve_chr_regions_from_tsv`(chrnam[, tresh, ...])	Read specified regions for a chromosome from tsv.
`retrieve_all_chr_regions_from_tsv`(chrnam[, tresh, ...])	Read all specified regions for a chromosome from tsv files.
`ref_at_clickpoint`(chrnam, clickp, kind)	Get reference label for segment with clickpoint.
`ref_in_segment`(chrnam, segm, kind, ref)	Get reference labels for segment.

Module Contents¶

coalispr.bedgraph_analyze.genom.logger¶

coalispr.bedgraph_analyze.genom.chroms()¶

Lists chromosome names.

Returns:

A list of strings referring to numbers/names of all chromosomes that form the reference genome used for mapping cDNA reads.

The returned list is created by create_genom_indexes.

Return type:

list

coalispr.bedgraph_analyze.genom.get_lengths()¶

Gives a dict of chromosome lengths.

Notes

For example, this dict of chromosome names vs. their lengths is returned for EXP jec21.

{
'1': 2300533, '2': 1632307, '3': 2105742, '4': 1783081, '5': 1507550,
'6': 1438950, '7': 1347793, '8': 1194300, '9': 1178688, '10': 1085720,
'11': 1019846, '12': 906719, '13': 787999, '14': 762694,
**CHRXTRA**: int(length chrxtra),
}

The lengths file can be extended with an artifical chromosome (CHRXTRA) collating features (possibly) added by gene modification like the sequences of selectable markers, plasmids, promoters, terminators, CRISPR/Cas components or cleavage-guides. Also it could comprise known features not incorporated (yet) in the reference genome annotatotion.

Returns:: dict – A dict with lengths of chromosomes after parsing lengths files.
Return type:: {str: int}

coalispr.bedgraph_analyze.genom.smallest2chroms()¶

coalispr.bedgraph_analyze.genom.create_genom_indexes(do_interval=False)¶

Generates range or interval indexes and a chromosome list.

Notes

For peak comparisons all reads documented in a bedgraph as fragments (via start, end) are gathered in bins of size BINSTEP.

Binning relies on an existing index. Bedgraph files only come with fragment entries. For comparison, they need to have a common interval index.

This function generates this common index; it creates an interval index for each chromosome from pd.interval_range. It splits up the whole g enome into bins of size BINSTEP.

Then, bedgraph files are reindexed on this common index, which is frequently used, so that a copy is stored in memory.

Idea from https://pbpython.com/pandas-qcut-cut.html

Parameters:: do_interval (bool (default: False)) – If True, create interval index (needed for binning bedgraphs). If False, create range indexes for chromosomes.

coalispr.bedgraph_analyze.genom.get_genom_indexes(do_interval=False)¶

Return indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp (millions of base pairs, 19051922 bp) is checked in the case of EXP jec21, yielding an index of 381031 lines:

interval ranges¶
	chr	range
0	1	[0, 50)	<—- start interval
1	1	[50, 100)
2	1	[100, 150)
…
46008	1	[2300400, 2300450)
46009	1	[2300450, 2300500)	<—- end chromosome 1
46010	2	[0, 50)	<—- start interval
46011	2	[50, 100)
…
…
365776	13	[787850, 787900)
365777	13	[787900, 787950)	<—- end chromosome 13
365778	14	[0, 50)	<—- start interval
365779	14	[50, 100)
…
381029	14	[762550, 762600)
381030	14	[762600, 762650)	<—- end chromosome 14

Parameters:: do_interval (bool (default: False)) – If True, return interval ranges (needed for binning bedgraphs). The default returns range indexes for chromosomes (chromosome name linked to the section of the third column covering that chromosome, with only the start value of each interval: 0, 50, 100,..).
Returns:: A dict of chromosome names with range indexes or interval ranges.
Return type:: dict

coalispr.bedgraph_analyze.genom.chr_test()¶

Get a chromosome name as found in original gtf/bedgraphs.

Returns:: Name of first chromosome in the genome.
Return type:: str

coalispr.bedgraph_analyze.genom.retrieve_chr_regions_from_tsv(chrnam, tresh=LOG2BG, maincut=UNSPECLOG10, tag=TAG, kind=SPECIFIC, usecols=[LOWR, UPPR, SPAN])¶

Read specified regions for a chromosome from tsv.

Get stored information on upper and ‘lower’ boundaries for regions of contiguous reads. These have previously been obtained by assessing bedgraph data and saved as TSV files by gather_regions functions in module coalispr.bedgraph_analyze.compare.

Parameters:

chrnam (str) – Name of the chromosome for which the information has been stored.
tresh (int (default: LOG2B)) – Applied treshold above which values have been taken into account.
maincut (float (default: UNSPECLOG10)) – Used minimal log10 difference between specific and unspecific values.
tag (str (default: TAG)) – Tag to indicate kind of mapped reads analysed: collapsed or uncollapsed
kind (str (default: SPECIFIC)) – The kind of specified reads stored: specific, unspecific or both.
usecols (list) – Columns to keep in returned dataframes

Returns:

A tuple of pandas dataframes, one for each strand of chromosome chrnam.

Return type:

pandas.DataFrame, pandas.DataFrame

coalispr.bedgraph_analyze.genom.retrieve_all_chr_regions_from_tsv(chrnam, tresh=LOG2BG, maincut=UNSPECLOG10, tag=TAG)¶

Read all specified regions for a chromosome from tsv files.

Get stored information on upper and ‘lower’ boundaries for all regions of contiguous reads (Previously obtained by assessing bedgraphs and saved as TSV by gather*regions of coalispr.bedgraph_analyze.compare).

Parameters:

chrnam (str) – Name of the chromosome for which the information has been stored.
tresh (int (default: LOG2B)) – Applied treshold above which values have been taken into account.
maincut (float (default: UNSPECLOG10)) – Used minimal log10 difference between specific and unspecific values.
tag (str (default: TAG)) – Tag to indicate kind of mapped reads analysed: collapsed or uncollapsed

Returns:

A tuple of pandas dataframes, one for each strand of chromosome chrnam.

Return type:

pandas.DataFrame, pandas.DataFrame

class coalispr.bedgraph_analyze.genom.Track(chrnam)¶

A class to represent a track of segments.

A track provides input for a matplotlib.collections.BrokenBarHCollection used in coalispr.bedgraph_analyze.bedgraph_plotting.

chrnam¶

The name of the chromosome for which the track is made

Type:: str

df1, df2

Tuple of pandas dataframes, for strand 1 (PLUS) and strand 2 (MINUS).

Type:: pandas.DataFrame, pandas.DataFrame

df¶

Pandas dataframe, used for obtaining segment information.

Type:: pandas.DataFrame

chrnam¶

df = None¶

textlist(clickp)¶

Show list of information for regions under the cursor.

Parameters:: clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
Returns:: List of information associated with segments under the cursor.
Return type:: list

get_segments(df)¶

Return list of segments with information that form the track.

Parameters:: df (pandas.DataFrame) – A pandas dataframe with segment information
Returns:: List of lower boundaries and length of segments, parseable by matplotlib.collections.BrokenBarHCollection.
Return type:: list

get_ctext(clickp)¶

Decribe (first) region under the cursor after clicking (if any).

Parameters:: clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
Returns:: Text associated with first entry of listed regions under cursor.
Return type:: str

class coalispr.bedgraph_analyze.genom.SegmentTrack(chrnam)¶

Bases: Track

A class to represent a track of countable segments.

chrnam¶

The name of the chromosome for which the track is made

Type:: str

dfs, dfa

Tuple of pandas dataframes, for strand 1 (plus) and strand 2 (minus).

Type:: pandas.DataFrame, pandas.DataFrame

df¶

Pandas dataframe, used for obtaining segment information.

Type:: pandas.DataFrame

chrnam¶

df¶

textlist(clickp)¶

Show list of information for regions under the cursor.

Parameters:: clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
Returns:: List of information associated with segments under the cursor.
Return type:: list

class coalispr.bedgraph_analyze.genom.GTFtrack(chrnam, kind, strand)¶

Bases: Track

A Track class to represent a track with annotation info from a GTF file.

kind¶

The kind of GTF information, for reference, or regions with SPECIFIC or UNSPECIFIC reads.

Type:: str

strand¶

The strand with the segment the annotation refers to.

Type:: str

kind¶

strand¶

textlist(clickp)¶

Get list of information for regions under the cursor.

Method overwrites that of parent, using another function.

Parameters:: clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
Returns:: List of information associated with segments under the cursor.
Return type:: list

coalispr.bedgraph_analyze.genom.ref_at_clickpoint(chrnam, clickp, kind)¶

Get reference label for segment with clickpoint.

Parameters:

chrnam (str) – The name of the chromosome for which GTF info is retrieved.
clickp (int) – The click point of the segment under the cursor.
kind (str) – The kind of specified reads for which a GTF could be prepared: (SPECIFIC, UNSPECIFIC, REFERENCE)

Returns:

Lists with gene_id’s from the GTF described by kind; one list for each strand of chromosome chrnam,

Return type:

tuple of lists

coalispr.bedgraph_analyze.genom.ref_in_segment(chrnam, segm, kind, ref)¶

Get reference labels for segment.

Parameters:

chrnam (str) – The name of the chromosome for which GTF info is retrieved.
segm ((int,int)) – The segment to check.
kind (str) – The kind of specified reads for which a GTF could be prepared: (SPECIFIC, UNSPECIFIC)
ref (bool) – Include general REFERENCE GTF for annotations (slow). 0: No; 1: Yes.

Returns:

Lists with gene_id’s from the GTF described by kind; one list for each strand of chromosome chrnam.

Return type:

tuple of lists