Hide navigation sidebar

Hide table of contents sidebar

Skip to content

Toggle site navigation sidebar

coalispr documentation

Toggle table of contents sidebar

Contents:

Overview
How-to guides
Tutorials
Toggle navigation of Tutorials
- Mouse miRNAs
  Toggle navigation of Mouse miRNAs
  - Mouse-shared
  - Commonly unmapped
- Yeast RBPs
  Toggle navigation of Yeast RBPs
  - Yeast-shared
- Cryptococcus siRNAs
  Toggle navigation of Cryptococcus siRNAs
- Oligonucleotides
Bio‑informatics: Integrate negative controls to get the good data
Toggle navigation of Bio‑informatics: Integrate negative controls to get the good data
- SRP RNA
- U1 snRNA
- Nuclear RNase P
- Box C/D snoRNAs
- U14 & snR190
- Targets for snoRNAs
  Toggle navigation of Targets for snoRNAs
  - SCARNA6
  - SnoR30
  - SNORD16
  - SNORD49
  - SNORD79
  - snR40_snR56
  - SNORD96
  - SnoR29
  - snoZ107-SnoR29b
  - SNORD110
  - snR13
  - snR45-U13_U3a
  - snR38
  - snR39b
  - snR45-II
  - snR40-like
  - snoU2-30
  - snR41
  - snR70
  - snR41-SNORD7
  - snR47-SNORD36C
  - snR48-SNORD60
  - snR51s-SNORD57
  - snR51_snR79
  - snR52-SNORD83
  - snR67-snoU31b
  - snR53
  - snR55
  - snR78
  - snR77
  - snR73-snoZ3
  - snR73b
  - snR58
  - snR60-Z15
  - snR61-SNORD38
  - SNORD17
    Toggle navigation of SNORD17
    - Loci with complementarity to SNORD17
    - Tremellomycetes
  - snR62-SNORD34
  - snR63-SNORD46
  - snR64-SNORD74
  - snR66
  - snR67d
  - snR69
  - snR71-SNORD29
  - snR71b
  - snR72
  - snR75-SNORD15
  - snR76-SNORD88
  - snR50_snR40l
  - snR74
  - snR88
  - snRcnh01600
  - snRcnag12093
    Toggle navigation of snRcnag12093
    - Loci with complementarity to snRcnag12093
    - Tremellomycetes
  - snRcnag12441
  - snRcne03050-SNORD30
    Toggle navigation of snRcne03050-SNORD30
    - Loci with complementarity to snRcne03050 D-guide
    - Tremellomycetes
  - snRcnf04440
    Toggle navigation of snRcnf04440
    - Loci with complementarity to snRcnf04440
    - Tremellomycetes
  - snRcng00300
    Toggle navigation of snRcng00300
    - Loci with complementarity to snRcng00300
  - snRcnk02420-1
  - snRcnk02420-2
    Toggle navigation of snRcnk02420-2
    - Loci with complementarity to snRcnk02420-1_-2
    - Tremellomycetes
  - U3b
  - U14 & snR190
  - U18-SNORD18
  - U24-SNORD24
- H/ACA snoRNAs
About
References
Glossary
Common errors
coalispr.api
Toggle navigation of coalispr.api
- coalispr.bedgraph_analyze
  Toggle navigation of coalispr.bedgraph_analyze
- coalispr.coalispr
- coalispr.count_analyze
  Toggle navigation of coalispr.count_analyze
- coalispr.resources
  Toggle navigation of coalispr.resources
  - coalispr.resources.cmd_parser
  - coalispr.resources.constant
  - coalispr.resources.constant_in
    Toggle navigation of coalispr.resources.constant_in
    - coalispr.resources.constant_in.make_constant
  - coalispr.resources.dialog
  - coalispr.resources.numeric
  - coalispr.resources.plot_utilities
  - coalispr.resources.share
    Toggle navigation of coalispr.resources.share
  - coalispr.resources.utilities

Toggle table of contents sidebar

coalispr.resources.share.no_empty_fa¶

Below set of shell commands work on a small file but which uses too much RAM in the case of a very long file with sequencing data in fastq.

#for i in SRR644*; do
#  echo "Processing ${i} .."
#    infile="${i}/${i}"
#    declare -a LINNUM
#    # Gather line numbers for empty reads in reverse order to allow for sequential deletion
#    # Line  numbers begin with @:
#    # @SN7001365:465:H5KKCBCX2:1:1107:3039:1981 1:N:0:ANCACG
#    LINNUM+=($(gunzip -c ${infile}.fastq.gz | grep -nB 1 '^ *$' | grep '@'| cut -d '-' -f 1 | sort -nr ))
#    gunzip -c ${infile}.fastq.gz > tmp
#    for lino in "${LINNUM[@]}"; do
#      #echo "Line number $lino will be deleted"
#      range="${lino}, $((lino +3))"
#      sed -i "$range d" tmp;
#    done
#    cp tmp ${infile}-uncollapsed.fastq
#    gzip ${infile}-uncollapsed.fastq
#    rm tmp
#done

For fasta files it is easier, and the files are smaller: all empty reads are collapsed into one.

#for i in SRR644*; do
#  echo "Processing ${i} .."
#    infile="${i}/${i}"
#    gunzip -c ${infile}.fastq.gz | pyFastqDuplicateRemover.py -o tmp.fasta
#    # there will be only one unique fasta read without any length; find its name
#    LINNAM=$(grep -B1 '^ *$' tmp.fasta)
#    # use double quote to expand variable in sed command
#    sed "/$LINNAM/,+1 d" tmp.fasta > ${infile}-collapsed.fasta
#    rm tmp.fasta
#done

This python script covers both kind of sequencing files; it directly copies the non-empty sequences (in fasta or fastq format) to a new file without a need to store the information into RAM for filtering and sorting.

Functions¶

`clean_file`(infile)	Take out empty sequences left after adapter removal.
`main`(args)

Module Contents¶

coalispr.resources.share.no_empty_fa.clean_file(infile)¶

Take out empty sequences left after adapter removal.

Parameters:: infile (str) – Filename of fasta file to be corrected

coalispr.resources.share.no_empty_fa.main(args)¶

coalispr.resources.share.reduce_gtf

coalispr.resources.share.ncfasta2gtf

Copyright © 2022-2024, Rob van Nues

Made with Sphinx and @pradyunsg's Furo

On this page

coalispr.resources.share.no_empty_fa
- Functions
- Module Contents
  - clean_file()
  - main()