coalispr.resources.share.no_empty_fa¶
Below set of shell commands work on a small file but which uses too much RAM in the case of a very long file with sequencing data in fastq.
#for i in SRR644*; do
# echo "Processing ${i} .."
# infile="${i}/${i}"
# declare -a LINNUM
# # Gather line numbers for empty reads in reverse order to allow for sequential deletion
# # Line numbers begin with @:
# # @SN7001365:465:H5KKCBCX2:1:1107:3039:1981 1:N:0:ANCACG
# LINNUM+=($(gunzip -c ${infile}.fastq.gz | grep -nB 1 '^ *$' | grep '@'| cut -d '-' -f 1 | sort -nr ))
# gunzip -c ${infile}.fastq.gz > tmp
# for lino in "${LINNUM[@]}"; do
# #echo "Line number $lino will be deleted"
# range="${lino}, $((lino +3))"
# sed -i "$range d" tmp;
# done
# cp tmp ${infile}-uncollapsed.fastq
# gzip ${infile}-uncollapsed.fastq
# rm tmp
#done
For fasta files it is easier, and the files are smaller: all empty reads are collapsed into one.
#for i in SRR644*; do
# echo "Processing ${i} .."
# infile="${i}/${i}"
# gunzip -c ${infile}.fastq.gz | pyFastqDuplicateRemover.py -o tmp.fasta
# # there will be only one unique fasta read without any length; find its name
# LINNAM=$(grep -B1 '^ *$' tmp.fasta)
# # use double quote to expand variable in sed command
# sed "/$LINNAM/,+1 d" tmp.fasta > ${infile}-collapsed.fasta
# rm tmp.fasta
#done
This python script covers both kind of sequencing files; it directly copies the non-empty sequences (in fasta or fastq format) to a new file without a need to store the information into RAM for filtering and sorting.
Functions¶
|
Take out empty sequences left after adapter removal. |
|
Module Contents¶
- coalispr.resources.share.no_empty_fa.clean_file(infile)¶
Take out empty sequences left after adapter removal.
- Parameters:
infile (str) – Filename of fasta file to be corrected
- coalispr.resources.share.no_empty_fa.main(args)¶