coalispr.bedgraph_analyze.genom

This module produces genome descriptors used for parsing and imaging.

Attributes

Functions

chroms()

Lists chromosome names.

get_lengths()

Gives a dict of chromosome lengths.

smallest2chroms()

For test counting, find 2 normal chromosomes that are the smallest

create_genom_indexes([do_interval])

Generates range or interval indexes and a chromosome list in one go.

get_genom_indexes()

Return chromosome range indexes.

get_temp_genom_indexes()

Create temporary files for indexes of large genomes to manage memory

chr_test()

Get a chromosome name as found in original gtf/bedgraphs.

Module Contents

coalispr.bedgraph_analyze.genom.logger
coalispr.bedgraph_analyze.genom.chroms()

Lists chromosome names.

Returns:

A list of strings referring to numbers/names of all chromosomes that form the reference genome used for mapping cDNA reads.

The returned list is created by create_genom_indexes.

Return type:

list

coalispr.bedgraph_analyze.genom.get_lengths()

Gives a dict of chromosome lengths.

Notes

For example, this dict of chromosome names vs. their lengths is returned for EXP jec21.

{
'1': 2300533, '2': 1632307, '3': 2105742, '4': 1783081, '5': 1507550,
'6': 1438950, '7': 1347793, '8': 1194300, '9': 1178688, '10': 1085720,
'11': 1019846, '12': 906719, '13': 787999, '14': 762694,
**ADD_GDNA**: int(length gdna), **CHRXTRA**: int(length chrxtra),
}

The lengths file can be extended with an artifical chromosome comprising known features not incorporated (yet) in the reference genome annotatotion (ADD_GDNA) for exogeneous DNA (CHRXTRA) collating features (possibly) added by gene modification like the sequences of selectable markers, plasmids, promoters, terminators, CRISPR/Cas components or cleavage-guides, basically everything not present in a natural wild-type cell as far as is known. The ADD-GDNA will be counted as a wild type sequence; while CHRXTRA counts will be kept apart, but traceable and can be visualized.

Returns:

dict – A dict with lengths of chromosomes after parsing lengths files.

Return type:

{str: int}

coalispr.bedgraph_analyze.genom.smallest2chroms()

For test counting, find 2 normal chromosomes that are the smallest (for H99 and mouse data).

coalispr.bedgraph_analyze.genom.create_genom_indexes(do_interval=False)

Generates range or interval indexes and a chromosome list in one go.

Notes

For peak comparisons all reads documented in a bedgraph as fragments (via start, end) are gathered in bins of size BINSTEP.

Binning relies on an existing index. Bedgraph files only come with fragment entries. For comparison, they need to have a common interval index.

This function generates this common index; it creates an interval index for each chromosome from pd.interval_range. It splits up the whole genome into bins of size BINSTEP.

Then, bedgraph files are reindexed on this common index, which is frequently used, so that a copy is stored in memory.

Idea from https://pbpython.com/pandas-qcut-cut.html

Create indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp (millions of base pairs, 19051922 bp) is checked in the case of EXP jec21, yielding an index of 381031 lines:

interval ranges

chr

range

0

1

[0, 50)

<—- start interval

1

1

[50, 100)

2

1

[100, 150)

46008

1

[2300400, 2300450)

46009

1

[2300450, 2300500)

<—- end chromosome 1

46010

2

[0, 50)

<—- start interval

46011

2

[50, 100)

365776

13

[787850, 787900)

365777

13

[787900, 787950)

<—- end chromosome 13

365778

14

[0, 50)

<—- start interval

365779

14

[50, 100)

381029

14

[762550, 762600)

381030

14

[762600, 762650)

<—- end chromosome 14

Parameters:

do_interval (bool (default: False)) – If True, return interval ranges (needed for binning bedgraphs). The default creates range indexes for chromosomes (chromosome name linked to the section of the third column covering that chromosome, with only the start value of each interval: 0, 50, 100,..).

coalispr.bedgraph_analyze.genom.get_genom_indexes()

Return chromosome range indexes.

Returns:

A dict of chromosome names with range indexes.

Return type:

dict

coalispr.bedgraph_analyze.genom.get_temp_genom_indexes()

Create temporary files for indexes of large genomes to manage memory usage. https://pythonspeed.com/articles/faster-multiprocessing-pickle/

coalispr.bedgraph_analyze.genom.chr_test()

Get a chromosome name as found in original gtf/bedgraphs.

Returns:

Name of first chromosome in the genome.

Return type:

str