[Bioc-devel] Proof-of-concept parallel preloading FastqStreamer

Martin Morgan mtmorgan at fhcrc.org
Wed Oct 2 21:44:10 CEST 2013


On 10/02/2013 11:58 AM, Gregoire Pau wrote:
> Hello Ryan,
>
> You may be interested in the function sclapply(...) located in the
> HTSeqGenie package. sclapply is a multicore dispatcher that accepts 3 main
> arguments (inext, fun, max.parallel.jobs). The data produced by the
> function inext, executed in the main thread, is dispatched to fun(),
> executed in a children thread. A built-in scheduler controls the maximal
> number of threads.
>
> In HTSeqGenie, inext(...) is typically an iterator to read chunks of FastQ
> reads, which are passed to a function processing the FastQ reads (for
> counting, QC, alignment...) in a children thread. sclapply(...) enables
> multicore processing of iterator flows and offers performance gains almost
> proportional to the number of cores. Moreover, the function is robust and
> contains extra arguments to handle exceptions and periodical tracing (e.g.
> to check memory usage).
>
> Hope this can help,

I'd like to incorporate these ideas (distil Ryan's and Greg's) in to 
BiocParallel, as bpiterate or maybe bpstream (I think in the literature stream 
has the notion of indeterminate, which isn't quite accurate). Let me know (on or 
off list) if that's not ok.

It would be interesting to see support for other back-ends.

And to come up with a consistent error handling model, incorporating the work 
Michel has recently completed (not yet in BiocParallel) as part of GSOC

   https://github.com/Bioconductor/BiocParallel/pull/19

Martin

>
> Cheers,
>
> Greg
>
>
> On Mon, Sep 30, 2013 at 5:00 PM, Ryan <rct at thompsonclan.org> wrote:
>
>> Hi all,
>>
>> I have previously written an Rscript to read, filter, and write large
>> fastq files using FastqSteamer to read. Through some complicated tricks, I
>> was able to get the input to happen in parallel with the processing and
>> output (using parallel::mcparallel and friends). In other words, while my
>> script was processing and writing out the nth block of reads, another
>> process was reading the (n+1)th block of reads at the same time. This
>> almost doubled the speed of my script (the server had sufficient I/O
>> bandwidth to parallelize reads and writes to disk). Since then, I've been
>> wanting to generalize this pattern, and I have just now made a working
>> proof of concept. It is a wrapper for FastqStreamer that runs in a separate
>> process and uses parallel:::sendMaster to send each block to the main
>> script, and then calls yield on the FastqStreamer to preload the next block
>> while the script is processing the current one. You can view and download
>> the script here:
>>
>> https://gist.github.com/**DarwinAwardWinner/6771922<https://gist.github.com/DarwinAwardWinner/6771922>
>>
>> I have strategically placed print statementsin the code in order to
>> demonstrate that preloading is happening. For example, I get the following
>> when I run the script on my machine:
>>
>> CHILD: Preloaded 1 yields.
>> CHILD: Sent 1 yields.
>> CHILD: Preloaded 2 yields.
>> CHILD: Sent 2 yields.
>> MAIN: Received 1 yields.
>> MAIN: Processing reads
>> CHILD: Preloaded 3 yields.
>> MAIN: Processed 1 yields.
>> CHILD: Sent 3 yields.
>> MAIN: Received 2 yields.
>> MAIN: Processing reads
>> CHILD: Preloaded 4 yields.
>> MAIN: Processed 2 yields.
>> CHILD: Sent 4 yields.
>> MAIN: Received 3 yields.
>> MAIN: Processing reads
>> CHILD: Preloaded 5 yields.
>> MAIN: Processed 3 yields.
>> CHILD: Sent 5 yields.
>> MAIN: Received 4 yields.
>> MAIN: Processing reads
>> CHILD: Preloaded 6 yields.
>> MAIN: Processed 4 yields.
>> CHILD: Sent 6 yields.
>> MAIN: Received 5 yields.
>> MAIN: Processing reads
>> MAIN: Processed 5 yields.
>> MAIN: Received 6 yields.
>> MAIN: Processing reads
>> MAIN: Processed 6 yields.
>>
>> In the script, the child is reading the fastq file, and the main process
>> is doing the "calculation" (which is just a sleep). As you can see, the
>> child is always a step or two ahead of the main script, so that whenever
>> the main script asks for the next yield, it gets it immediately instead of
>> waiting for the child to read from the disk.
>>
>> So, is this kind of feature appropriate for inclusion into BioConductor?
>>
>> -Ryan Thompson
>>
>> ______________________________**_________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list