[Bioc-sig-seq] filtering using solexa quality scores

Cei Abreu-Goodger cei at ebi.ac.uk
Thu Apr 16 00:25:51 CEST 2009


Hi Vincent,

Are you taking into account that quality scores will tend to drop off 
towards the end of the run? I would probably restrict any sort of 
quality filtering to the first x bases of each read... From my 
experience, only a very small fraction of reads out of a "good" run 
would be removed due to general quality issues. Also, if your further 
pipeline is "quality-aware" (eg MAQ/bowtie for alignments) you can get 
away with not worrying initially about the quality of the reads. On the 
other hand, for some kinds of analysis I was dropping the quality scores 
and making plain fasta files. In these cases it would pay off to convert 
very low-quality bases to Ns, since I would get better coverage.

Cheers,

Cei

Vincent Carey wrote:
> i have scoured our archives and found little regarding role of solexa
> quality
> scores as reported in fastq outputs in short read filtering.
> 
> my understanding is that a numerical score of -4 or greater indicates more
> probability
> mass on the called base than on any other.  in checking 1e6 reads on each of
> two lanes
> i found the frequency of the event " fewer than three bases have score less
> than -4" to be
> 4e-3 in one lane and 2e-3 in another.  in other words, filtering by
> requiring no more than
> two < -4 scores would take you from a million reads to about 2000-4000,
> assuming i have
> not taken a biased sample (i may have, just took the first 1e6 in fastq).
> 
> is there any reason to regard a call with score < -4 to be much different
> from an 'N'?
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.



More information about the Bioc-sig-sequencing mailing list