[BioC] how to deal with a 30G fastq file

Thu Oct 6 05:39:12 CEST 2011

On 10/05/2011 08:04 PM, wang peter wrote:
> IT is too slow to read them in the memory.
> who can tell me if i need split them by other program or
> call some R function to split them

ShortRead::FastqSampler streams the entire file but returns a subset 
(often faster than reading in the whole data) ShortRead::FastqStreamer 
(in development) iterates over the file --

   fq = FastqStreamer(<...>)
   while (length(res <- yield(fq)))
       # work, e.g., filter

A cheap hack is to force R to allocate a large amount of memory and then 
to run the operation

   replicate(10, raw(1e9)) ## that's alot
   dna = readFastq(...)

The 'withIds=FALSE' argument to readFastq can save a lot of time if ids 
are not necessary.

If the records are all 4 lines long it is very easy to split a file 
(untested code; the Linux pros would use awk for efficient processing; 
check out StackOverflow / Biostar)

   fl = file("foo.fastq", "r")
   idx = 0;
   while (isIncomplete(fl)) {
       recs = readLines(fl, n=1000000)
       writeLines(sprintf("fout-%d.fastq", idx), recs)
       idx <- idx + 1
   }

once split on Linux / Mac use library(multicore) or library(parallel) 
(R-2.14 or later) and

   mclapply(seq_len(idx), function(i) {
       fq = readFastq(sprintf("fout-%d.fastq", idx))
       ## work, then...
       TRUE
   })

to process in parallel (it doesn't make sense to try to read them in 
parallel and try to return them back to a 'master').

Martin

>
> thx
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793