coalispr.resources.share.no_empty_fa

Below set of shell commands work on a small file but which uses too much RAM in the case of a very long file with sequencing data in fastq.

#for i in SRR644*; do
#  echo "Processing ${i} .."
#    infile="${i}/${i}"
#    declare -a LINNUM
#    # Gather line numbers for empty reads in reverse order to allow for sequential deletion
#    # Line  numbers begin with @:
#    # @SN7001365:465:H5KKCBCX2:1:1107:3039:1981 1:N:0:ANCACG
#    LINNUM+=($(gunzip -c ${infile}.fastq.gz | grep -nB 1 '^ *$' | grep '@'| cut -d '-' -f 1 | sort -nr ))
#    gunzip -c ${infile}.fastq.gz > tmp
#    for lino in "${LINNUM[@]}"; do
#      #echo "Line number $lino will be deleted"
#      range="${lino}, $((lino +3))"
#      sed -i "$range d" tmp;
#    done
#    cp tmp ${infile}-uncollapsed.fastq
#    gzip ${infile}-uncollapsed.fastq
#    rm tmp
#done

For fasta files it is easier, and the files are smaller: all empty reads are collapsed into one.

#for i in SRR644*; do
#  echo "Processing ${i} .."
#    infile="${i}/${i}"
#    gunzip -c ${infile}.fastq.gz | pyFastqDuplicateRemover.py -o tmp.fasta
#    # there will be only one unique fasta read without any length; find its name
#    LINNAM=$(grep -B1 '^ *$' tmp.fasta)
#    # use double quote to expand variable in sed command
#    sed "/$LINNAM/,+1 d" tmp.fasta > ${infile}-collapsed.fasta
#    rm tmp.fasta
#done

This python script covers both kind of sequencing files; it directly copies the non-empty sequences (in fasta or fastq format) to a new file without a need to store the information into RAM for filtering and sorting.

Functions

clean_file(infile)

Take out empty sequences left after adapter removal.

main(args)

Module Contents

coalispr.resources.share.no_empty_fa.clean_file(infile)

Take out empty sequences left after adapter removal.

Parameters:

infile (str) – Filename of fasta file to be corrected

coalispr.resources.share.no_empty_fa.main(args)