coalispr.bedgraph_analyze.genom

This module produces genome descriptors used for parsing and imaging.

Attributes

Classes

Track

A class to represent a track of segments.

SegmentTrack

A class to represent a track of countable segments.

GTFtrack

A Track class to represent a track with annotation info from a GTF file.

Functions

chroms()

Lists chromosome names.

get_lengths()

Gives a dict of chromosome lengths.

smallest2chroms()

create_genom_indexes([do_interval])

Generates range or interval indexes and a chromosome list.

get_genom_indexes([do_interval])

Return indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp

chr_test()

Get a chromosome name as found in original gtf/bedgraphs.

retrieve_chr_regions_from_tsv(chrnam[, tresh, ...])

Read specified regions for a chromosome from tsv.

retrieve_all_chr_regions_from_tsv(chrnam[, tresh, ...])

Read all specified regions for a chromosome from tsv files.

ref_at_clickpoint(chrnam, clickp, kind)

Get reference label for segment with clickpoint.

ref_in_segment(chrnam, segm, kind, ref)

Get reference labels for segment.

Module Contents

coalispr.bedgraph_analyze.genom.logger
coalispr.bedgraph_analyze.genom.chroms()

Lists chromosome names.

Returns:

A list of strings referring to numbers/names of all chromosomes that form the reference genome used for mapping cDNA reads.

The returned list is created by create_genom_indexes.

Return type:

list

coalispr.bedgraph_analyze.genom.get_lengths()

Gives a dict of chromosome lengths.

Notes

For example, this dict of chromosome names vs. their lengths is returned for EXP jec21.

{
'1': 2300533, '2': 1632307, '3': 2105742, '4': 1783081, '5': 1507550,
'6': 1438950, '7': 1347793, '8': 1194300, '9': 1178688, '10': 1085720,
'11': 1019846, '12': 906719, '13': 787999, '14': 762694,
**CHRXTRA**: int(length chrxtra),
}

The lengths file can be extended with an artifical chromosome (CHRXTRA) collating features (possibly) added by gene modification like the sequences of selectable markers, plasmids, promoters, terminators, CRISPR/Cas components or cleavage-guides. Also it could comprise known features not incorporated (yet) in the reference genome annotatotion.

Returns:

dict – A dict with lengths of chromosomes after parsing lengths files.

Return type:

{str: int}

coalispr.bedgraph_analyze.genom.smallest2chroms()
coalispr.bedgraph_analyze.genom.create_genom_indexes(do_interval=False)

Generates range or interval indexes and a chromosome list.

Notes

For peak comparisons all reads documented in a bedgraph as fragments (via start, end) are gathered in bins of size BINSTEP.

Binning relies on an existing index. Bedgraph files only come with fragment entries. For comparison, they need to have a common interval index.

This function generates this common index; it creates an interval index for each chromosome from pd.interval_range. It splits up the whole g enome into bins of size BINSTEP.

Then, bedgraph files are reindexed on this common index, which is frequently used, so that a copy is stored in memory.

Idea from https://pbpython.com/pandas-qcut-cut.html

Parameters:

do_interval (bool (default: False)) – If True, create interval index (needed for binning bedgraphs). If False, create range indexes for chromosomes.

coalispr.bedgraph_analyze.genom.get_genom_indexes(do_interval=False)

Return indexes to use. With a BINSTEP of 50, a genome of ~19 Mbp (millions of base pairs, 19051922 bp) is checked in the case of EXP jec21, yielding an index of 381031 lines:

interval ranges

chr

range

0

1

[0, 50)

<—- start interval

1

1

[50, 100)

2

1

[100, 150)

46008

1

[2300400, 2300450)

46009

1

[2300450, 2300500)

<—- end chromosome 1

46010

2

[0, 50)

<—- start interval

46011

2

[50, 100)

365776

13

[787850, 787900)

365777

13

[787900, 787950)

<—- end chromosome 13

365778

14

[0, 50)

<—- start interval

365779

14

[50, 100)

381029

14

[762550, 762600)

381030

14

[762600, 762650)

<—- end chromosome 14

Parameters:

do_interval (bool (default: False)) – If True, return interval ranges (needed for binning bedgraphs). The default returns range indexes for chromosomes (chromosome name linked to the section of the third column covering that chromosome, with only the start value of each interval: 0, 50, 100,..).

Returns:

A dict of chromosome names with range indexes or interval ranges.

Return type:

dict

coalispr.bedgraph_analyze.genom.chr_test()

Get a chromosome name as found in original gtf/bedgraphs.

Returns:

Name of first chromosome in the genome.

Return type:

str

coalispr.bedgraph_analyze.genom.retrieve_chr_regions_from_tsv(chrnam, tresh=LOG2BG, maincut=UNSPECLOG10, tag=TAG, kind=SPECIFIC, usecols=[LOWR, UPPR, SPAN])

Read specified regions for a chromosome from tsv.

Get stored information on upper and ‘lower’ boundaries for regions of contiguous reads. These have previously been obtained by assessing bedgraph data and saved as TSV files by gather_regions functions in module coalispr.bedgraph_analyze.compare.

Parameters:
  • chrnam (str) – Name of the chromosome for which the information has been stored.

  • tresh (int (default: LOG2B)) – Applied treshold above which values have been taken into account.

  • maincut (float (default: UNSPECLOG10)) – Used minimal log10 difference between specific and unspecific values.

  • tag (str (default: TAG)) – Tag to indicate kind of mapped reads analysed: collapsed or uncollapsed

  • kind (str (default: SPECIFIC)) – The kind of specified reads stored: specific, unspecific or both.

  • usecols (list) – Columns to keep in returned dataframes

Returns:

A tuple of pandas dataframes, one for each strand of chromosome chrnam.

Return type:

pandas.DataFrame, pandas.DataFrame

coalispr.bedgraph_analyze.genom.retrieve_all_chr_regions_from_tsv(chrnam, tresh=LOG2BG, maincut=UNSPECLOG10, tag=TAG)

Read all specified regions for a chromosome from tsv files.

Get stored information on upper and ‘lower’ boundaries for all regions of contiguous reads (Previously obtained by assessing bedgraphs and saved as TSV by gather*regions of coalispr.bedgraph_analyze.compare).

Parameters:
  • chrnam (str) – Name of the chromosome for which the information has been stored.

  • tresh (int (default: LOG2B)) – Applied treshold above which values have been taken into account.

  • maincut (float (default: UNSPECLOG10)) – Used minimal log10 difference between specific and unspecific values.

  • tag (str (default: TAG)) – Tag to indicate kind of mapped reads analysed: collapsed or uncollapsed

Returns:

A tuple of pandas dataframes, one for each strand of chromosome chrnam.

Return type:

pandas.DataFrame, pandas.DataFrame

class coalispr.bedgraph_analyze.genom.Track(chrnam)

A class to represent a track of segments.

A track provides input for a matplotlib.collections.BrokenBarHCollection used in coalispr.bedgraph_analyze.bedgraph_plotting.

chrnam

The name of the chromosome for which the track is made

Type:

str

df1, df2

Tuple of pandas dataframes, for strand 1 (PLUS) and strand 2 (MINUS).

Type:

pandas.DataFrame, pandas.DataFrame

df

Pandas dataframe, used for obtaining segment information.

Type:

pandas.DataFrame

chrnam
df = None
textlist(clickp)

Show list of information for regions under the cursor.

Parameters:

clickp (int) – X-coordinate of point under cursor; registered after mouse-click.

Returns:

List of information associated with segments under the cursor.

Return type:

list

get_segments(df)

Return list of segments with information that form the track.

Parameters:

df (pandas.DataFrame) – A pandas dataframe with segment information

Returns:

List of lower boundaries and length of segments, parseable by matplotlib.collections.BrokenBarHCollection.

Return type:

list

get_ctext(clickp)

Decribe (first) region under the cursor after clicking (if any).

Parameters:

clickp (int) – X-coordinate of point under cursor; registered after mouse-click.

Returns:

Text associated with first entry of listed regions under cursor.

Return type:

str

class coalispr.bedgraph_analyze.genom.SegmentTrack(chrnam)

Bases: Track

A class to represent a track of countable segments.

chrnam

The name of the chromosome for which the track is made

Type:

str

dfs, dfa

Tuple of pandas dataframes, for strand 1 (plus) and strand 2 (minus).

Type:

pandas.DataFrame, pandas.DataFrame

df

Pandas dataframe, used for obtaining segment information.

Type:

pandas.DataFrame

chrnam
df
textlist(clickp)

Show list of information for regions under the cursor.

Parameters:

clickp (int) – X-coordinate of point under cursor; registered after mouse-click.

Returns:

List of information associated with segments under the cursor.

Return type:

list

class coalispr.bedgraph_analyze.genom.GTFtrack(chrnam, kind, strand)

Bases: Track

A Track class to represent a track with annotation info from a GTF file.

kind

The kind of GTF information, for reference, or regions with SPECIFIC or UNSPECIFIC reads.

Type:

str

strand

The strand with the segment the annotation refers to.

Type:

str

kind
strand
textlist(clickp)

Get list of information for regions under the cursor.

Method overwrites that of parent, using another function.

Parameters:

clickp (int) – X-coordinate of point under cursor; registered after mouse-click.

Returns:

List of information associated with segments under the cursor.

Return type:

list

coalispr.bedgraph_analyze.genom.ref_at_clickpoint(chrnam, clickp, kind)

Get reference label for segment with clickpoint.

Parameters:
  • chrnam (str) – The name of the chromosome for which GTF info is retrieved.

  • clickp (int) – The click point of the segment under the cursor.

  • kind (str) – The kind of specified reads for which a GTF could be prepared: (SPECIFIC, UNSPECIFIC, REFERENCE)

Returns:

Lists with gene_id’s from the GTF described by kind; one list for each strand of chromosome chrnam,

Return type:

tuple of lists

coalispr.bedgraph_analyze.genom.ref_in_segment(chrnam, segm, kind, ref)

Get reference labels for segment.

Parameters:
  • chrnam (str) – The name of the chromosome for which GTF info is retrieved.

  • segm ((int,int)) – The segment to check.

  • kind (str) – The kind of specified reads for which a GTF could be prepared: (SPECIFIC, UNSPECIFIC)

  • ref (bool) – Include general REFERENCE GTF for annotations (slow). 0: No; 1: Yes.

Returns:

Lists with gene_id’s from the GTF described by kind; one list for each strand of chromosome chrnam.

Return type:

tuple of lists