[Bioc-sig-seq] Read big fastq files in chunks

Martin Morgan mtmorgan at fhcrc.org
Tue Sep 27 14:33:19 CEST 2011


On 09/27/2011 02:39 AM, Anita Lerch wrote:
> Hi,
> I should filter a big fastq file (HiSeq ca.>30'000'000, which does not
> fit in memory) for low qualities e.g. to many N in the sequence.
>
> I plan to read the file in chunks, filter the ShortReads and write them
> into a new file.
> This is no problem for fasta files, see the example:
>
> filename<- "data/DmS2DRSC_RNA_seqs_subsample.fa"
> eof<- FALSE
> append<- FALSE
> cycle<- 1L
> while(!eof){
>    chunk<- readFasta(filename, nrec=nrec, skip=(cycle-1)*nrec)
>    nFilt<- nFilter(5)
>    writeFasta(chunk[nFilt(chunk)],
>               sprintf("%s/filtered_%s",
>                       dirname(filename), basename(filename)),
>               append=append)
>    append<- TRUE
>    cycle<- cycle + 1
>    if(length(chunk) == 0L)
>        eof<- TRUE
> }
>
> If I try to do the same for fastq file e.g.
>
> filename<- "data/test_data.fq"
> eof<- FALSE
> mode<- "w"
> cycle<- 1L
> while(!eof){
>    chunk<- readFastq(filename)
>    nFilt<- nFilter(5)
>    writeFastq(chunk[nFilt(chunk)],
>               sprintf("%s/filtered_%s",
>                       dirname(filename), basename(filename)),
>               mode=mode)
>    mode<- "a"
>    cycle<- cycle + 1
>    if(length(chunk) == 0L)
>        eof<- TRUE
>    }
>
> I facing some problems when I read the ShortReads:
> - read.DNAStringSet ignores the quality sequence.
> - readFastq can't read in chunks
> - FastqSampler choose a random sample of ShortRead
>
> Of course I can do a workaround e.g writing my own fastq parser, but I
> prefer a clean solution or to solve the problem.
> So my questions, is there a clean solution I'm not aware of it?
> If I'm right and there is this problem, what is the plan of the
> community to solve this problem or is this already work in progress?
> Can I help to solve this problem e.g. rewriting the readFastq()
> function.

Hi Anita -- this needs to be added to ShortRead, and I can do so later 
this week. But if you'd like to help, I would like to create a reference 
class

   .FastqStreamer <- setRefClass("FastqStreamer", <...>)

a constructor

   FastqStreamer <- function(con, n=1e6..., verbose=FALSE)

and a method

   yield,FastqStreamer-method

so that streaming through a file looks like

   f <- FastqStreamer(con, n=1e6, verbose=FALSE)
   while (length(srq <- yield(f))) {
        ## work
   }

I think .FastqStreamer would share some infrastructure with 
.FastqSampler, and would take the same approach as in 
R/methods-Sampler.R -- treat the file at the R level as a connection 
that is input with readBin() so that R's connection interface can be 
used, do some R-level buffering, and translate the raw() to ShortReadQ 
objects in C (like src/sampler.c:sampler_rec_parser). Probably there is 
some code re-factoring and a little reference class hierarchy .FastqFile 
as parent of .FastqStreamer and .FastqSampler.

Martin

>
> Thanks for the answer.
> Greetings
> Anita
>
>> sessionInfo()
> R Under development (unstable) (2011-09-20 r57033)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8
>   [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C                  LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.11.37    latticeExtra_0.6-18  RColorBrewer_1.0-5   Rsamtools_1.5.65
> [5] lattice_0.19-33      Biostrings_2.21.9    GenomicRanges_1.5.47 IRanges_1.11.26
>
> loaded via a namespace (and not attached):
>   [1] Biobase_2.13.9       BiocInstaller_1.1.27 BSgenome_1.21.3      grid_2.14.0
>   [5] hwriter_1.3          QuasR_0.1.1          RCurl_1.6-10         rtracklayer_1.13.13
>   [9] tools_2.14.0         XML_3.4-3            zlibbioc_0.1.7
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list