[Bioc-sig-seq] filter fastQ based on quality score?

Jennifer Wernegreen jwernegreen at mbl.edu
Wed Oct 28 21:00:56 CET 2009


Hello,

I'm hoping that someone can offer guidance about filtering reads from  
fastQ files based on quality score.

Using Shortread (or another program in Bioconductor), is it possible  
to filter a fastQ file to remove reads that have an average quality  
score below a certain value, and then output the remaining reads as a  
fastQ file?   If so, could one apply this filter to paired-end data?   
For example, it would be helpful to remove a read pair when either or  
both individual reads in that pair fall below the quality threshold.

I see the defined filters in Shortread (strandFilter, nFilter, etc),  
but I'm not sure how to create a custom filter based on average  
quality score.  My modest start on the problem and a few lines from  
our fastQ files are below.

Any hints would be very much appreciated.

Jen



library(ShortRead)
reads <- readFastq("~/", pattern="query.fastq")
    # import reads from fastq file into ShortReadQ Object
    # replace "~/" with your path and query.fastq with filename

#### ?? how to filter out (remove) reads if average quality score <25?  
(for example) ####
#### ?? how to apply this filter to paired-end data ####

writeFastq(filteredReads, "output.fastq")
# export filtered reads to fastq file



A couple of sample lines from the two fastq files...  We have about  
7.5 million reads per file.  ("ANNACG" notes the adaptor for this  
library, which oddly had a consistent N in the sequence.)


paired end file 1

@HWW-T7400-009_1_120_1879_1134#ANNACG/1
TGTTGTAACTCTTTAATTACCCACNGAGNNNNATTCATAATAATTGCTNNNNNNNNNNANTTATATTAAAACTGATATCATACAAAAATAAATAATCATT
+
AB?B>A>@@BABBC??@BABB>3=&561&&&&4 at BBBB>BB%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@HWW-T7400-009_1_120_1879_1950#ANNACG/1
ACTATTATGTCTATTCGAGAGCATNATCNNNNGTACTGGATTATCGATNNNNNNNNNNGNGACATCAATTTTAGTAAACAGTTTCCATTCCTTACTTTAT
+
ABB9BBABBABB<BABB?B8>A?8&9A;&&&&9=A>BB@@A%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




paired end file 2

@HWW-T7400-009_1_120_1879_1134#ANNACG/2
GCGCAGTATATAATTTCCATTGTTTTTCTCAAGTATAACATACTGTATGATATTGTCGGATTAAAATAAAAATAAAATATTAATGATTATTTATTTTTGT
+
?CCCBBABA?BBCBCC=B?CAC at B87ABBC?7>B=C at 5?@@4=BBBCBC9BBCB?AABB;=AA@;>B<5@? 
@?6-?BBBBB>=BC???B=-AB?9=9ABB
@HWW-T7400-009_1_120_1879_1950#ANNACG/2
TCCTCGACATGCACTGAATAATTCATTGTGGATAGGATCATATATTACACCTATTTCAATATTTTTGTTTTGTTGTACAGCAATAGATAAAGTAAGGAAT
+
BBA@@B4<3BB?7=?=<=B7ABBB>BA at A@A1;BBA?AA<<=@>==?9B?>B@>A@=0=6*A at B:89)>? 
5B?>B@>=@BB>4B2A=-53>B*&80 at 81@



More information about the Bioc-sig-sequencing mailing list