Coalispr introduction¶

Coalispr (COunt ALIgned SPecified Reads) is a Python tool to clean up (small) RNA sequencing results. It can visualize over 100 bedgraphs in one panel [1] and helps to retrieve read counts from associated bam files without reliance on reference features (GTF annotations).

Features¶

Fast and voluminous

Reduced resolution decreases memory-use and speeds up comparison.

Handle a large number of samples simultaneously.

Count specified aligned reads [2].

Count collapsed instead of single reads.

Input files are bedgraph files

Bedgraph-data are imported for processing with Pandas.

Reads are collected by their mid-point into bins [4].

A common index [3] enables comparison between all samples.

Comparisons are done per chromosome [5] strand.

Specific reads (S, M) are separated from unspecific reads (U)

(by checking bin-overlap first, then, signal difference).

For fast reuse, created datastructures are stored in a binary format.
Interactive visualization

All bedgraphs for a chromosome are shown (with Matplotlib).

Toggle data display.

Load GTF files for overlap with annotated features.

Signal scale: normal or log2.

Include reference RNA sequencing data (R).

Save snapshots as svg, jpg, pdf or png.
Counting

Map contiguous regions of unspecific or specific reads.

These segment definitions are stored in tsv files to:

Retrieve specified reads from bam files with Pysam.

Split segments into a number of bins to profile coverage.

Collect counts for various read properties and save to tsv files.

Obtain counts for particular chromosomal regions.

Thus, counting relies on genome coordinates, not GTF references.

Analysis

Count-outputs can be diagrammed (with Matplotlib and Seaborn).

Compare numbers for reads, cDNAs, introns, multimappers.

Check length-distributions of reads, also for a particular genomic region.

Annotate count files with gene-information from GTF references.

For a rationale and application of Coalispr see the essay: ‘Bio‑informatics: Integrate negative controls to get the good data’.

Requirements¶

Preparation (used in the normal work flow and in the Tutorials):
Bash, Flexbar, pyCRAC, Samtools, SRA-toolkit, STAR (or another aligner).
Run-time:
Python, Numpy, Pandas, Matplotlib, Pysam, Seaborn. [6]
Enough RAM for loading genome data.

The numerical expression evaluator for NumPy, Numexpr, can help to get the most of your machine computing capabilities [7].

Installation¶

Coalispr is on Codeberg.org and Pypi.org from where it can be downloaded.

Configuration files with properties will have to be edited by the user to analyze their own data (see Tutorials). Therefore, this package is best installed locally in user space, not system wide. Alternatively, the program can be installed in a virtual environment [8].

After extraction of the source archive, go to the coalispr project folder with the setup.py and pyproject.toml files and run in a terminal (as user):

python3 -m pip install --editable .

This also makes it easy to adapt source code and directly test the changes.

A script, callable from the command line with coalispr, will be installed locally [9] (alternatively, you can run python3 -m coalispr instead of coalispr).

With installs of pandas-2.x please link coalispr/resources/numeric.py to python3/site_packages/pandas/core/indexes/ (see here).

Installation can be done in a virtual environment as described in INSTALL.txt

Run Coalispr¶

In a terminal run the following command-line, which shows the various options for Coalispr:

coalispr -h

See the How-to guides and the Tutorials for more information.

Contribute¶

All resources for Coalispr are accessible at Codeberg.org.

Source Code: https://codeberg.org/coalispr/coalispr
Issue Tracker: https://codeberg.org/coalispr/coalispr/issues

Documentation¶

This documentation is online at https://coalispr.codeberg.page/
Sources for the documentation can be found at https://codeberg.org/coalispr/coalispr/docs
Datasets supporting the tutorials are published at Zenodo.org under DOI 10.5281/zenodo.12822543

Licences¶

The program source code is published under the European Union Public Licence (EUPL)
The documentation is under the Creative Commons Attribution License (CC-BY-4.0)

Author¶

The author has been trained as a molecular biologist and from that angle got involved with high-throughput analysis (see About).

Notes¶

# cd to folder with virtual environments
# create environment 1 (env1) with module venv
    bash-5.2$ python3 -m venv env1
# activate env1
    bash-5.2$ source env1/bin/activate
    (env1) bash-5.2$
# extract dist/package
    (env1) bash-5.2$ tar -xvzf /<path_to>/coalispr-$VERSION.tar.gz
    (env1) bash-5.2$ cd coalispr-$VERSION
# install with:
    (env1) bash-5.2$ python3 -m pip install --editable .
# add link when using pandas-2.x
    (env1) bash-5.2$ ln -s -r coalispr-$VERSION/coalispr/resources/numeric.py \
                    -t env1/lib/python3.11/site-packages/pandas/core/indexes/
# run:
    (env1) bash-5.2$ coalispr -h
# stop the virtual environment:
    (env1) bash-5.2$ deactivate
     bash-5.2$