coalispr.bedgraph_analyze.genom¶

This module produces genome descriptors used for parsing and imaging.

Attributes¶

logger

Functions¶

`chroms`()	Lists chromosome names.
`get_lengths`()	Gives a dict of chromosome lengths.
`smallest2chroms`()	For test counting, find 2 normal chromosomes that are the smallest
`create_genom_indexes`([do_interval])	Generates range or interval indexes and a chromosome list in one go.
`get_genom_indexes`()	Return chromosome range indexes.
`get_temp_genom_indexes`()	Create temporary files for indexes of large genomes to manage memory
`chr_test`()	Get a chromosome name as found in original gtf/bedgraphs.

Module Contents¶

coalispr.bedgraph_analyze.genom.logger¶

coalispr.bedgraph_analyze.genom.chroms()¶

Lists chromosome names.

Returns:

A list of strings referring to numbers/names of all chromosomes that form the reference genome used for mapping cDNA reads.

The returned list is created by create_genom_indexes.

Return type:

list

coalispr.bedgraph_analyze.genom.get_lengths()¶

Gives a dict of chromosome lengths.

Notes

For example, this dict of chromosome names vs. their lengths is returned for EXP jec21.

{
'1': 2300533, '2': 1632307, '3': 2105742, '4': 1783081, '5': 1507550,
'6': 1438950, '7': 1347793, '8': 1194300, '9': 1178688, '10': 1085720,
'11': 1019846, '12': 906719, '13': 787999, '14': 762694,
**ADD_GDNA**: int(length gdna), **CHRXTRA**: int(length chrxtra),
}

The lengths file can be extended with an artifical chromosome comprising known features not incorporated (yet) in the reference genome annotatotion (ADD_GDNA) for exogeneous DNA (CHRXTRA) collating features (possibly) added by gene modification like the sequences of selectable markers, plasmids, promoters, terminators, CRISPR/Cas components or cleavage-guides, basically everything not present in a natural wild-type cell as far as is known. The ADD-GDNA will be counted as a wild type sequence; while CHRXTRA counts will be kept apart, but traceable and can be visualized.

Returns:: dict – A dict with lengths of chromosomes after parsing lengths files.
Return type:: {str: int}

coalispr.bedgraph_analyze.genom.smallest2chroms()¶: For test counting, find 2 normal chromosomes that are the smallest (for H99 and mouse data).

coalispr.bedgraph_analyze.genom.create_genom_indexes(do_interval=False)¶

Generates range or interval indexes and a chromosome list in one go.

Notes

For peak comparisons all reads documented in a bedgraph as fragments (via start, end) are gathered in bins of size BINSTEP.

Binning relies on an existing index. Bedgraph files only come with fragment entries. For comparison, they need to have a common interval index.

This function generates this common index; it creates an interval index for each chromosome from pd.interval_range. It splits up the whole genome into bins of size BINSTEP.

Then, bedgraph files are reindexed on this common index, which is frequently used, so that a copy is stored in memory.

Idea from https://pbpython.com/pandas-qcut-cut.html

Create indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp (millions of base pairs, 19051922 bp) is checked in the case of EXP jec21, yielding an index of 381031 lines:

interval ranges¶
	chr	range
0	1	[0, 50)	<—- start interval
1	1	[50, 100)
2	1	[100, 150)
…
46008	1	[2300400, 2300450)
46009	1	[2300450, 2300500)	<—- end chromosome 1
46010	2	[0, 50)	<—- start interval
46011	2	[50, 100)
…
…
365776	13	[787850, 787900)
365777	13	[787900, 787950)	<—- end chromosome 13
365778	14	[0, 50)	<—- start interval
365779	14	[50, 100)
…
381029	14	[762550, 762600)
381030	14	[762600, 762650)	<—- end chromosome 14

Parameters:: do_interval (bool (default: False)) – If True, return interval ranges (needed for binning bedgraphs). The default creates range indexes for chromosomes (chromosome name linked to the section of the third column covering that chromosome, with only the start value of each interval: 0, 50, 100,..).

coalispr.bedgraph_analyze.genom.get_genom_indexes()¶

Return chromosome range indexes.

Returns:: A dict of chromosome names with range indexes.
Return type:: dict

coalispr.bedgraph_analyze.genom.get_temp_genom_indexes()¶: Create temporary files for indexes of large genomes to manage memory usage. https://pythonspeed.com/articles/faster-multiprocessing-pickle/

coalispr.bedgraph_analyze.genom.chr_test()¶

Get a chromosome name as found in original gtf/bedgraphs.

Returns:: Name of first chromosome in the genome.
Return type:: str