coalispr.bedgraph_analyze.genom¶
This module produces genome descriptors used for parsing and imaging.
Attributes¶
Functions¶
|
Lists chromosome names. |
Gives a dict of chromosome lengths. |
|
For test counting, find 2 normal chromosomes that are the smallest |
|
|
Generates range or interval indexes and a chromosome list in one go. |
Return chromosome range indexes. |
|
Create temporary files for indexes of large genomes to manage memory |
|
|
Get a chromosome name as found in original gtf/bedgraphs. |
Module Contents¶
- coalispr.bedgraph_analyze.genom.logger¶
- coalispr.bedgraph_analyze.genom.chroms()¶
Lists chromosome names.
- Returns:
A list of strings referring to numbers/names of all chromosomes that form the reference genome used for mapping cDNA reads.
The returned list is created by
create_genom_indexes.- Return type:
list
- coalispr.bedgraph_analyze.genom.get_lengths()¶
Gives a dict of chromosome lengths.
Notes
For example, this dict of chromosome names vs. their lengths is returned for EXP
jec21.{ '1': 2300533, '2': 1632307, '3': 2105742, '4': 1783081, '5': 1507550, '6': 1438950, '7': 1347793, '8': 1194300, '9': 1178688, '10': 1085720, '11': 1019846, '12': 906719, '13': 787999, '14': 762694, **ADD_GDNA**: int(length gdna), **CHRXTRA**: int(length chrxtra), }
The lengths file can be extended with an artifical chromosome comprising known features not incorporated (yet) in the reference genome annotatotion (ADD_GDNA) for exogeneous DNA (CHRXTRA) collating features (possibly) added by gene modification like the sequences of selectable markers, plasmids, promoters, terminators, CRISPR/Cas components or cleavage-guides, basically everything not present in a natural wild-type cell as far as is known. The ADD-GDNA will be counted as a wild type sequence; while CHRXTRA counts will be kept apart, but traceable and can be visualized.
- Returns:
dict – A dict with lengths of chromosomes after parsing lengths files.
- Return type:
{str: int}
- coalispr.bedgraph_analyze.genom.smallest2chroms()¶
For test counting, find 2 normal chromosomes that are the smallest (for H99 and mouse data).
- coalispr.bedgraph_analyze.genom.create_genom_indexes(do_interval=False)¶
Generates range or interval indexes and a chromosome list in one go.
Notes
For peak comparisons all reads documented in a bedgraph as fragments (via start, end) are gathered in bins of size BINSTEP.
Binning relies on an existing index. Bedgraph files only come with fragment entries. For comparison, they need to have a common interval index.
This function generates this common index; it creates an interval index for each chromosome from
pd.interval_range. It splits up the whole genome into bins of size BINSTEP.Then, bedgraph files are reindexed on this common index, which is frequently used, so that a copy is stored in memory.
Idea from https://pbpython.com/pandas-qcut-cut.html
Create indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp (millions of base pairs, 19051922 bp) is checked in the case of EXP
jec21, yielding an index of 381031 lines:interval ranges¶ chr
range
0
1
[0, 50)
<—- start interval
1
1
[50, 100)
2
1
[100, 150)
…
46008
1
[2300400, 2300450)
46009
1
[2300450, 2300500)
<—- end chromosome 1
46010
2
[0, 50)
<—- start interval
46011
2
[50, 100)
…
…
365776
13
[787850, 787900)
365777
13
[787900, 787950)
<—- end chromosome 13
365778
14
[0, 50)
<—- start interval
365779
14
[50, 100)
…
381029
14
[762550, 762600)
381030
14
[762600, 762650)
<—- end chromosome 14
- Parameters:
do_interval (bool (default: False)) – If True, return interval ranges (needed for binning bedgraphs). The default creates range indexes for chromosomes (chromosome name linked to the section of the third column covering that chromosome, with only the start value of each interval: 0, 50, 100,..).
- coalispr.bedgraph_analyze.genom.get_genom_indexes()¶
Return chromosome range indexes.
- Returns:
A dict of chromosome names with range indexes.
- Return type:
dict
- coalispr.bedgraph_analyze.genom.get_temp_genom_indexes()¶
Create temporary files for indexes of large genomes to manage memory usage. https://pythonspeed.com/articles/faster-multiprocessing-pickle/
- coalispr.bedgraph_analyze.genom.chr_test()¶
Get a chromosome name as found in original gtf/bedgraphs.
- Returns:
Name of first chromosome in the genome.
- Return type:
str