[Bioc-sig-seq] Read big fastq files in chunks

Tue Sep 27 16:17:21 CEST 2011

On Tue, 2011-09-27 at 05:33 -0700, Martin Morgan wrote:
> On 09/27/2011 02:39 AM, Anita Lerch wrote:
> > Hi,
> > I should filter a big fastq file (HiSeq ca.>30'000'000, which does not
> > fit in memory) for low qualities e.g. to many N in the sequence.
> >
> > I plan to read the file in chunks, filter the ShortReads and write them
> > into a new file.
> > This is no problem for fasta files, see the example:
> >
> > filename<- "data/DmS2DRSC_RNA_seqs_subsample.fa"
> > eof<- FALSE
> > append<- FALSE
> > cycle<- 1L
> > while(!eof){
> >    chunk<- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec)
> >    nFilt<- nFilter(5)
> >    writeFasta(chunk[nFilt(chunk)],
> >               sprintf("%s/filtered_%s",
> >                       dirname(filename), basename(filename)),
> >               append=append)
> >    append<- TRUE
> >    cycle<- cycle + 1
> >    if(length(chunk) == 0L)
> >        eof<- TRUE
> > }
> >
> > If I try to do the same for fastq file e.g.
> >
> > filename<- "data/test_data.fq"
> > eof<- FALSE
> > mode<- "w"
> > cycle<- 1L
> > while(!eof){
> >    chunk<- readFastq(filename)
> >    nFilt<- nFilter(5)
> >    writeFastq(chunk[nFilt(chunk)],
> >               sprintf("%s/filtered_%s",
> >                       dirname(filename), basename(filename)),
> >               mode=mode)
> >    mode<- "a"
> >    cycle<- cycle + 1
> >    if(length(chunk) == 0L)
> >        eof<- TRUE
> >    }
> >
> > I facing some problems when I read the ShortReads:
> > - read.DNAStringSet ignores the quality sequence.
> > - readFastq can't read in chunks
> > - FastqSampler choose a random sample of ShortRead
> >
> > Of course I can do a workaround e.g writing my own fastq parser, but I
> > prefer a clean solution or to solve the problem.
> > So my questions, is there a clean solution I'm not aware of it?
> > If I'm right and there is this problem, what is the plan of the
> > community to solve this problem or is this already work in progress?
> > Can I help to solve this problem e.g. rewriting the readFastq()
> > function.
> 
> Hi Anita -- this needs to be added to ShortRead, and I can do so later 
> this week. But if you'd like to help, I would like to create a reference 
> class
> 
>    .FastqStreamer <- setRefClass("FastqStreamer", <...>)
> 
> a constructor
> 
>    FastqStreamer <- function(con, n=1e6..., verbose=FALSE)
> 
> and a method
> 
>    yield,FastqStreamer-method
> 
> so that streaming through a file looks like
> 
>    f <- FastqStreamer(con, n=1e6, verbose=FALSE)
>    while (length(srq <- yield(f))) {
>         ## work
>    }
> 
> I think .FastqStreamer would share some infrastructure with 
> .FastqSampler, and would take the same approach as in 
> R/methods-Sampler.R -- treat the file at the R level as a connection 
> that is input with readBin() so that R's connection interface can be 
> used, do some R-level buffering, and translate the raw() to ShortReadQ 
> objects in C (like src/sampler.c:sampler_rec_parser). Probably there is 
> some code re-factoring and a little reference class hierarchy .FastqFile 
> as parent of .FastqStreamer and .FastqSampler.

Great. I think that is the right way.
I know the reference classes, but I don't have any working experience
with them. I start with implementing and keep you informed.

Anita

-- 
Anita Lerch
Friedrich Miescher Institute
Maulbeerstrasse 66
WRO-1066.P22
4058 Basel
Phone: +41 (0)61 697 5172
Email: anita.lerch at fmi.ch