coalispr.bedgraph_analyze.genom¶
This module produces genome descriptors used for parsing and imaging.
Attributes¶
Classes¶
A class to represent a track of segments. |
|
A class to represent a track of countable segments. |
|
A Track class to represent a track with annotation info from a GTF file. |
Functions¶
|
Lists chromosome names. |
Gives a dict of chromosome lengths. |
|
|
Generates range or interval indexes and a chromosome list. |
|
Return indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp |
|
Get a chromosome name as found in original gtf/bedgraphs. |
|
Read specified regions for a chromosome from tsv. |
|
Read all specified regions for a chromosome from tsv files. |
|
Get reference label for segment with clickpoint. |
|
Get reference labels for segment. |
Module Contents¶
- coalispr.bedgraph_analyze.genom.logger¶
- coalispr.bedgraph_analyze.genom.chroms()¶
Lists chromosome names.
- Returns:
A list of strings referring to numbers/names of all chromosomes that form the reference genome used for mapping cDNA reads.
The returned list is created by
create_genom_indexes
.- Return type:
list
- coalispr.bedgraph_analyze.genom.get_lengths()¶
Gives a dict of chromosome lengths.
Notes
For example, this dict of chromosome names vs. their lengths is returned for EXP
jec21
.{ '1': 2300533, '2': 1632307, '3': 2105742, '4': 1783081, '5': 1507550, '6': 1438950, '7': 1347793, '8': 1194300, '9': 1178688, '10': 1085720, '11': 1019846, '12': 906719, '13': 787999, '14': 762694, **CHRXTRA**: int(length chrxtra), }
The lengths file can be extended with an artifical chromosome (CHRXTRA) collating features (possibly) added by gene modification like the sequences of selectable markers, plasmids, promoters, terminators, CRISPR/Cas components or cleavage-guides. Also it could comprise known features not incorporated (yet) in the reference genome annotatotion.
- Returns:
dict – A dict with lengths of chromosomes after parsing lengths files.
- Return type:
{str: int}
- coalispr.bedgraph_analyze.genom.smallest2chroms()¶
- coalispr.bedgraph_analyze.genom.create_genom_indexes(do_interval=False)¶
Generates range or interval indexes and a chromosome list.
Notes
For peak comparisons all reads documented in a bedgraph as fragments (via start, end) are gathered in bins of size BINSTEP.
Binning relies on an existing index. Bedgraph files only come with fragment entries. For comparison, they need to have a common interval index.
This function generates this common index; it creates an interval index for each chromosome from
pd.interval_range
. It splits up the whole g enome into bins of size BINSTEP.Then, bedgraph files are reindexed on this common index, which is frequently used, so that a copy is stored in memory.
Idea from https://pbpython.com/pandas-qcut-cut.html
- Parameters:
do_interval (bool (default: False)) – If True, create interval index (needed for binning bedgraphs). If False, create range indexes for chromosomes.
- coalispr.bedgraph_analyze.genom.get_genom_indexes(do_interval=False)¶
Return indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp (millions of base pairs, 19051922 bp) is checked in the case of EXP
jec21
, yielding an index of 381031 lines:interval ranges¶ chr
range
0
1
[0, 50)
<—- start interval
1
1
[50, 100)
2
1
[100, 150)
…
46008
1
[2300400, 2300450)
46009
1
[2300450, 2300500)
<—- end chromosome 1
46010
2
[0, 50)
<—- start interval
46011
2
[50, 100)
…
…
365776
13
[787850, 787900)
365777
13
[787900, 787950)
<—- end chromosome 13
365778
14
[0, 50)
<—- start interval
365779
14
[50, 100)
…
381029
14
[762550, 762600)
381030
14
[762600, 762650)
<—- end chromosome 14
- Parameters:
do_interval (bool (default: False)) – If True, return interval ranges (needed for binning bedgraphs). The default returns range indexes for chromosomes (chromosome name linked to the section of the third column covering that chromosome, with only the start value of each interval: 0, 50, 100,..).
- Returns:
A dict of chromosome names with range indexes or interval ranges.
- Return type:
dict
- coalispr.bedgraph_analyze.genom.chr_test()¶
Get a chromosome name as found in original gtf/bedgraphs.
- Returns:
Name of first chromosome in the genome.
- Return type:
str
- coalispr.bedgraph_analyze.genom.retrieve_chr_regions_from_tsv(chrnam, tresh=LOG2BG, maincut=UNSPECLOG10, tag=TAG, kind=SPECIFIC, usecols=[LOWR, UPPR, SPAN])¶
Read specified regions for a chromosome from tsv.
Get stored information on upper and ‘lower’ boundaries for regions of contiguous reads. These have previously been obtained by assessing bedgraph data and saved as TSV files by
gather_regions
functions in modulecoalispr.bedgraph_analyze.compare
.- Parameters:
chrnam (str) – Name of the chromosome for which the information has been stored.
tresh (int (default: LOG2B)) – Applied treshold above which values have been taken into account.
maincut (float (default: UNSPECLOG10)) – Used minimal log10 difference between specific and unspecific values.
tag (str (default: TAG)) – Tag to indicate kind of mapped reads analysed: collapsed or uncollapsed
kind (str (default: SPECIFIC)) – The kind of specified reads stored: specific, unspecific or both.
usecols (list) – Columns to keep in returned dataframes
- Returns:
A tuple of pandas dataframes, one for each strand of chromosome
chrnam
.- Return type:
pandas.DataFrame, pandas.DataFrame
- coalispr.bedgraph_analyze.genom.retrieve_all_chr_regions_from_tsv(chrnam, tresh=LOG2BG, maincut=UNSPECLOG10, tag=TAG)¶
Read all specified regions for a chromosome from tsv files.
Get stored information on upper and ‘lower’ boundaries for all regions of contiguous reads (Previously obtained by assessing bedgraphs and saved as TSV by
gather*regions
ofcoalispr.bedgraph_analyze.compare
).- Parameters:
chrnam (str) – Name of the chromosome for which the information has been stored.
tresh (int (default: LOG2B)) – Applied treshold above which values have been taken into account.
maincut (float (default: UNSPECLOG10)) – Used minimal log10 difference between specific and unspecific values.
tag (str (default: TAG)) – Tag to indicate kind of mapped reads analysed: collapsed or uncollapsed
- Returns:
A tuple of pandas dataframes, one for each strand of chromosome
chrnam
.- Return type:
pandas.DataFrame, pandas.DataFrame
- class coalispr.bedgraph_analyze.genom.Track(chrnam)¶
A class to represent a track of segments.
A track provides input for a
matplotlib.collections.BrokenBarHCollection
used incoalispr.bedgraph_analyze.bedgraph_plotting
.- chrnam¶
The name of the chromosome for which the track is made
- Type:
str
- df1, df2
Tuple of pandas dataframes, for strand 1 (PLUS) and strand 2 (MINUS).
- Type:
pandas.DataFrame, pandas.DataFrame
- df¶
Pandas dataframe, used for obtaining segment information.
- Type:
pandas.DataFrame
- chrnam¶
- df = None¶
- textlist(clickp)¶
Show list of information for regions under the cursor.
- Parameters:
clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
- Returns:
List of information associated with segments under the cursor.
- Return type:
list
- get_segments(df)¶
Return list of segments with information that form the track.
- Parameters:
df (pandas.DataFrame) – A pandas dataframe with segment information
- Returns:
List of lower boundaries and length of segments, parseable by
matplotlib.collections.BrokenBarHCollection
.- Return type:
list
- get_ctext(clickp)¶
Decribe (first) region under the cursor after clicking (if any).
- Parameters:
clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
- Returns:
Text associated with first entry of listed regions under cursor.
- Return type:
str
- class coalispr.bedgraph_analyze.genom.SegmentTrack(chrnam)¶
Bases:
Track
A class to represent a track of countable segments.
- chrnam¶
The name of the chromosome for which the track is made
- Type:
str
- dfs, dfa
Tuple of pandas dataframes, for strand 1 (plus) and strand 2 (minus).
- Type:
pandas.DataFrame, pandas.DataFrame
- df¶
Pandas dataframe, used for obtaining segment information.
- Type:
pandas.DataFrame
- chrnam¶
- df¶
- textlist(clickp)¶
Show list of information for regions under the cursor.
- Parameters:
clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
- Returns:
List of information associated with segments under the cursor.
- Return type:
list
- class coalispr.bedgraph_analyze.genom.GTFtrack(chrnam, kind, strand)¶
Bases:
Track
A Track class to represent a track with annotation info from a GTF file.
- kind¶
The kind of GTF information, for reference, or regions with SPECIFIC or UNSPECIFIC reads.
- Type:
str
- strand¶
The strand with the segment the annotation refers to.
- Type:
str
- kind¶
- strand¶
- textlist(clickp)¶
Get list of information for regions under the cursor.
Method overwrites that of parent, using another function.
- Parameters:
clickp (int) – X-coordinate of point under cursor; registered after mouse-click.
- Returns:
List of information associated with segments under the cursor.
- Return type:
list
- coalispr.bedgraph_analyze.genom.ref_at_clickpoint(chrnam, clickp, kind)¶
Get reference label for segment with clickpoint.
- Parameters:
chrnam (str) – The name of the chromosome for which GTF info is retrieved.
clickp (int) – The click point of the segment under the cursor.
kind (str) – The kind of specified reads for which a GTF could be prepared: (SPECIFIC, UNSPECIFIC, REFERENCE)
- Returns:
Lists with gene_id’s from the GTF described by
kind
; one list for each strand of chromosomechrnam
,- Return type:
tuple of lists
- coalispr.bedgraph_analyze.genom.ref_in_segment(chrnam, segm, kind, ref)¶
Get reference labels for segment.
- Parameters:
chrnam (str) – The name of the chromosome for which GTF info is retrieved.
segm ((int,int)) – The segment to check.
kind (str) – The kind of specified reads for which a GTF could be prepared: (SPECIFIC, UNSPECIFIC)
ref (bool) – Include general REFERENCE GTF for annotations (slow). 0: No; 1: Yes.
- Returns:
Lists with gene_id’s from the GTF described by
kind
; one list for each strand of chromosomechrnam
.- Return type:
tuple of lists