coalispr.resources.share.sub_gtf

Script to extract lines from general annotation file for a particular property, outputting a ‘sub’-gtf. Python alternative for below bash script.

#! /bin/bash
inputgz=$1

if [[ -z $inputgz ]]; then
  echo "Please, provide compressed annotations file (.gtf.gz) as input"
  exit
fi

# collect entries for common ncRNAs
gunzip -cf $inputgz | grep snRNA > tmp
gunzip -cf $inputgz | grep snoRNA >> tmp
gunzip -cf $inputgz | grep tRNA >> tmp
gunzip -cf $inputgz | grep rRNA >> tmp
sort -k 1.4h,1 -k 4n,4 -k 5nr,5 tmp > mouse_ncRNAs.gtf
rm tmp

Functions

create_gtf(kind, get_all, reference, features)

Create a kind of GTF file by extracting features from reference gtf.

main(args)

Module Contents

coalispr.resources.share.sub_gtf.create_gtf(kind, get_all, reference, features)

Create a kind of GTF file by extracting features from reference gtf.

Parameters:
  • kind (str) – Kind of feature for which a GTF is made. Used as output name.

  • reference (str) – Filename for annotation reference

  • features (str) – List of features to extract annotations for, recoverable from string.

Returns:

An annotation with the following fields:

seqname  - The name of the sequence. Must be a chromosome or
         scaffold.
source   - The program that generated this feature.
feature  - The name of this type of feature. Some examples of
         standard feature types are "CDS", "start_codon",
         "stop_codon", and "exon".
start    - The starting position of the feature in the
         sequence. The first base is numbered 1.
end      - The ending position of the feature (inclusive).
score    - A score between 0 and 1000. If the track line
         useScore attribute is set to 1 for this annotation
         data set, the score value will determine the level
         of gray in which this feature is displayed (higher
         numbers = darker gray). If there is no score value,
         enter ".".
strand   - Valid entries include '+', '-', or '.' (for don't
         know/don't care).
frame    - If the feature is a coding exon, frame should be a
         number between 0-2 that represents the reading
         frame of the first base. If the feature is not a
         coding exon, the value should be '.'.
comments - gene_id "Em:U62317.C22.6.mRNA"; transcript_id
         "Em:U62317.C22.6.mRNA"; exon_number 1

Return type:

GTF file

coalispr.resources.share.sub_gtf.main(args)