[BioC] how to deal with a 30G fastq file
Martin Morgan
mtmorgan at fhcrc.org
Thu Oct 6 05:39:12 CEST 2011
On 10/05/2011 08:04 PM, wang peter wrote:
> IT is too slow to read them in the memory.
> who can tell me if i need split them by other program or
> call some R function to split them
ShortRead::FastqSampler streams the entire file but returns a subset
(often faster than reading in the whole data) ShortRead::FastqStreamer
(in development) iterates over the file --
fq = FastqStreamer(<...>)
while (length(res <- yield(fq)))
# work, e.g., filter
A cheap hack is to force R to allocate a large amount of memory and then
to run the operation
replicate(10, raw(1e9)) ## that's alot
dna = readFastq(...)
The 'withIds=FALSE' argument to readFastq can save a lot of time if ids
are not necessary.
If the records are all 4 lines long it is very easy to split a file
(untested code; the Linux pros would use awk for efficient processing;
check out StackOverflow / Biostar)
fl = file("foo.fastq", "r")
idx = 0;
while (isIncomplete(fl)) {
recs = readLines(fl, n=1000000)
writeLines(sprintf("fout-%d.fastq", idx), recs)
idx <- idx + 1
}
once split on Linux / Mac use library(multicore) or library(parallel)
(R-2.14 or later) and
mclapply(seq_len(idx), function(i) {
fq = readFastq(sprintf("fout-%d.fastq", idx))
## work, then...
TRUE
})
to process in parallel (it doesn't make sense to try to read them in
parallel and try to return them back to a 'master').
Martin
>
> thx
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list