[Bioc-devel] Parallel processing of reads in a single fastq file

Tue Aug 19 18:53:37 CEST 2014

Hi,

bpiterate() has been added to BiocParallel 0.99.11. The current 
implementation is based on sclapply() from HTSeqGeni and is supported 
for the multi-core environment only. Support for other back-ends are in 
progress.

For the current implementation, iterating over multiple files can be 
done by distributing the files over a snow cluster with bplapply() then 
using each cluster node as a master to call bpiterate(). Example on the 
man page.

Maintaince of HTSeqGeni has been passed from Greg to Jens Reeder (cc'd 
on message). Jens, the one difference between bpiterate() and sclapply() 
is the absence of the trace function. Instead of having this hard coded 
we want to add a BPTRACE arg that allows tracing/debugging for any 
BiocParallel function. This should be added over the next week.

Valerie

On 08/06/2014 12:18 PM, Valerie Obenchain wrote:
> Hi Jeff,
>
> Thanks for the prompt. It looks like bpiterate or bpstream was intended
> but didn't quite make it into BiocParallel. I'll discuss with Martin to
> see if I'm missing other history / past discussions and then add it in.
> Ryan had some ideas for parallel streaming we discussed at Bioc2014 so
> this is timely. Both concepts can be revisited and implemented in some
> form.
>
>
> Greg,
>
> Just wanted to confirm it's ok with you that we put an iteration of
> sclapply in BiocParallel?
>
>
> Valerie
>
> On 08/06/2014 07:16 AM, Johnston, Jeffrey wrote:
>> Hi,
>>
>> I have been using FastqStreamer() and yield() to process a large fastq
>> file in chunks, modifying both the read and name and then appending
>> the output to a new fastq file as each chunk is processed. This works
>> great, but would benefit greatly from being parallelized.
>>
>> As far as I can tell, this problem isn’t easily solved with the
>> existing parallel tools because you can’t determine how many jobs
>> you’ll need in advance (you just call yield() until it stops returning
>> reads).
>>
>> After some digging, I found the sclapply() function in the HTSeqGenie
>> package by Gregoire Pau, which he describes as a “multicore dispatcher”:
>>
>> https://stat.ethz.ch/pipermail/bioc-devel/2013-October/004754.html
>>
>> I wasn’t able to get the package to install from source due to some
>> dependencies (there are no binaries for Mac), but I did extract the
>> function and adapt it slightly for my use case. Here’s an example:
>>
>> processChunk <- function(fq_chunk) {
>>    # manipulate fastq reads here
>> }
>>
>> yieldHelper <- function() {
>>    fq <- yield(fqstream)
>>    if(length(fq) == 0) return(NULL)
>>    fq
>> }
>>
>> fqstream <- FastqStreamer(“…”, n=1e6)
>> sclapply(yieldHelper, processChunk, max.parallel.jobs=4)
>> close(fqstream)
>>
>> Based on the discussion linked above, it seems like there was some
>> interest in integrating this idea into BiocParallel. I would find that
>> very useful as it improves performance quite a bit and can likely be
>> applied to numerous stream-based processing tasks.
>>
>> I will point out that in my case above, the processChunk() function
>> doesn’t return anything. Instead it appends the modified fastq records
>> to a new file. I have to use the Unix lockfile command to ensure that
>> only one child process appends to the output file at a time. I am not
>> certain if there is a more elegant solution to this (perhaps a queue
>> that is emptied by a dedicated writer process?).
>>
>> Thanks,
>> Jeff
>>
>>
>>
>>
>>     [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>