[Bioc-sig-seq] filter fastQ based on quality score?

Wed Oct 28 21:56:17 CET 2009

Jennifer Wernegreen <jwernegreen at mbl.edu> writes:

> Hello,
>
> I'm hoping that someone can offer guidance about filtering reads from
> fastQ files based on quality score.
>
> Using Shortread (or another program in Bioconductor), is it possible
> to filter a fastQ file to remove reads that have an average quality
> score below a certain value, and then output the remaining reads as a
> fastQ file?   If so, could one apply this filter to paired-end data?
> For example, it would be helpful to remove a read pair when either or
> both individual reads in that pair fall below the quality threshold.
>
> I see the defined filters in Shortread (strandFilter, nFilter, etc),
> but I'm not sure how to create a custom filter based on average
> quality score.  My modest start on the problem and a few lines from
> our fastQ files are below.

Hi Jen --

I did this to get some sample data

> example(readFastq)
> rfq
class: ShortReadQ
length: 256 reads; width: 36 cycles

The first thing is that the fastq files contain no information about
whether the scores are encoded on the 'Solexa' or standard Phred
scale, I'm guessing you'll need to convert

  rfq <- initialize(rfq, quality=as(quality(rfq), "SFastqQuality"))

You can create a matrix of scores, and then calculate the mean across
rows (maybe a median is more apporiate?)

  score <- rowMeans(as(quality(rfq), "matrix"))

(there is also alphabetScore, which doesn't require that all reads are
the same width; score <- alphabetScore(quality(rfq)) / width(rfq)).

The paired end part is more complicated, but you'd probably parse the
id's into an 'id' and a 'pair' part, along the lines of (untested)

  ids <- as.character(id(rfq))
  df <- data.frame(
           id=sub("/[[:digit:]]$", "", ids),
           pair=as.integer(sub(".*/([[:digit:]])$", "\\1", ids)),
           score=score,
           stringsAsFactors=FALSE)

The 'either or both (above) threshold' might be

  ok <- with(df, tapply(score, id, function(score, thresh) {
      all(score > thres)
  }, thresh=20))

'ok' has one element per id, but you need to make sure these are
ordered correctly, perhaps

  ok[df$id]

The final bit of the puzzle is to create a filter, as

  filt <- srFilter(function(x) {
    score <- rowMeans(as(quality(x), "matrix"))
    ids <- as.character(id(x))
    df <- data.frame(
             id=sub("/[[:digit:]]$", "", ids),
             score=score,
             stringsAsFactors=FALSE)
    ok <- with(df, tapply(score, id, function(score, thresh) {
        all(score > thresh)
    }, thresh=20))
    ok[df$id]
  }, name="HighQualityPairs")

and then

  rfq[filt(rfq)]

a slight variant would make a 'factory' to produce filters with
different thresholds (or other parameters)

highQualityPairsFilter <-function(threshold=20) {
  srFilter(function(x) {
    score <- rowMeans(as(quality(x), "matrix"))
    ids <- as.character(id(x))
    df <- data.frame(
             id=sub("/[[:digit:]]$", "", ids),
             score=score,
             stringsAsFactors=FALSE)
    ok <- with(df, tapply(score, id, function(score, thresh) {
        all(score > thresh)
    }, thresh=threshold)) # this line is different
    ok[df$id]
  }, name="HighQualityPairs")
}

and then

> rfq[highQualityPairsFilter(20)(rfq)]
class: ShortReadQ
length: 249 reads; width: 36 cycles
> rfq[highQualityPairsFilter(25)(rfq)]
class: ShortReadQ
length: 104 reads; width: 36 cycles

Martin

> Any hints would be very much appreciated.
>
> Jen
>
>
>
> library(ShortRead)
> reads <- readFastq("~/", pattern="query.fastq")
>     # import reads from fastq file into ShortReadQ Object
>     # replace "~/" with your path and query.fastq with filename
>
> #### ?? how to filter out (remove) reads if average quality score <25?
> (for example) ####
> #### ?? how to apply this filter to paired-end data ####
>
> writeFastq(filteredReads, "output.fastq")
> # export filtered reads to fastq file
>
>
>
> A couple of sample lines from the two fastq files...  We have about
> 7.5 million reads per file.  ("ANNACG" notes the adaptor for this
> library, which oddly had a consistent N in the sequence.)
>
>
> paired end file 1
>
> @HWW-T7400-009_1_120_1879_1134#ANNACG/1
> TGTTGTAACTCTTTAATTACCCACNGAGNNNNATTCATAATAATTGCTNNNNNNNNNNANTTATATTAAAACTGATATCATACAAAAATAAATAATCATT
> +
> AB?B>A>@@BABBC??@BABB>3=&561&&&&4 at BBBB>BB%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> @HWW-T7400-009_1_120_1879_1950#ANNACG/1
> ACTATTATGTCTATTCGAGAGCATNATCNNNNGTACTGGATTATCGATNNNNNNNNNNGNGACATCAATTTTAGTAAACAGTTTCCATTCCTTACTTTAT
> +
> ABB9BBABBABB<BABB?B8>A?8&9A;&&&&9=A>BB@@A%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
>
>
>
> paired end file 2
>
> @HWW-T7400-009_1_120_1879_1134#ANNACG/2
> GCGCAGTATATAATTTCCATTGTTTTTCTCAAGTATAACATACTGTATGATATTGTCGGATTAAAATAAAAATAAAATATTAATGATTATTTATTTTTGT
> +
> ?CCCBBABA?BBCBCC=B?CAC at B87ABBC?7>B=C at 5?@@4=BBBCBC9BBCB?AABB;=AA@;>B<5@?
> @?6-?BBBBB>=BC???B=-AB?9=9ABB
> @HWW-T7400-009_1_120_1879_1950#ANNACG/2
> TCCTCGACATGCACTGAATAATTCATTGTGGATAGGATCATATATTACACCTATTTCAATATTTTTGTTTTGTTGTACAGCAATAGATAAAGTAAGGAAT
> +
> BBA@@B4<3BB?7=?=<=B7ABBB>BA at A@A1;BBA?AA<<=@>==?9B?>B@>A@=0=6*A at B:89)>?
> 5B?>B@>=@BB>4B2A=-53>B*&80 at 81@
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793