[BioC] how to deal with a 30G fastq file
Martin Morgan
mtmorgan at fhcrc.org
Thu Oct 6 21:29:01 CEST 2011
Hi Steve --
On 10/06/2011 06:48 AM, Steve Lianoglou wrote:
> Hi Martin,
>
> Just wanted to say:
>
> On Wed, Oct 5, 2011 at 11:39 PM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
>
>> fq = FastqStreamer(<...>)
>> while (length(res<- yield(fq)))
>> # work, e.g., filter
>
> That's really cool!
Anita Lerch suggested and helped to implement this.
> Then some navel gazing:
>
> Have you thought about "inverting" this flow? Like, run the while loop
> in "C-land" but pass an R expression/block/something in and have it be
> evaluated within each iteration of the C/while loop?
>
> I'm guessing calling an R function from within C code is costly, but
> "while" loops in R are also slow (compared to while loops in C), so I
> wonder which would win in the long run.
Rsamtools::applyPileups does this. In some ways it's like lapply(<obj>,
FUN), where the user provides FUN and applyPileups does work at the C
level to prepare data for FUN.
FUN is like # work -- they are expecting to do stuff on R objects using
R code. For this reason they're both going to be efficient if they
operate on vectors, hence chunks (e.g., millions of records) of the
fastq or bam file. So yield() and applyPileups() have a similar task --
efficiently create a chunk of data to be processed, then pass that to
the user. Since they're both function calls, they are both free to
create those objects in R or C as appropriate.
The big difference is really in how the results of the iteration or the
apply are aggregated. yield() relies on the user to do something
('aggregate by writing to a file', or 'pre-allocate a result vector and
fill in with each iteration') whereas applyPileups returns a list, with
each element the result of FUN. If there were clear aggregation
strategies then the apply-style approach might have additional advantages.
This is still a bit of work in progress, so ideas welcome; one might
easily image that lapply(FastqStreamer(<...>), FUN, ...) could be
implemented in a straight-forward way, for instance.
Martin
> Just curious -- sorry if I missed some previous discussion on this topic.
>
> Anyway, like I said -- this is really cool already.
>
> Thanks,
>
> -steve
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list