[R-sig-hpc] "chunking" parallel tasks

Stephen Weston stephen.b.weston at gmail.com
Tue Jan 26 16:48:03 CET 2010


In parallel computing, "chunking" is used to bundle multiple messages
together, since big messages can very often be sent much more efficiently
than short messages.  That means that in systems like "snow" and "foreach",
if your task executes very quickly, most of the time may be spent moving
data around, rather than doing useful work.  By bundling many tasks
together, you might be able to do the communication efficiently enough so
that you get a benefit from doing the tasks in parallel.  However, if you
have short tasks and large inputs and/or outputs, chunking won't really
help you, since it isn't going to make the communication more efficient.
You need to figure out some way to decrease the amount of data that is
being moved around.

The nws package supports "chunking" via the eachElem "chunkSize" element
of the "eo" argument.  The multicore package supports chunking as an all
or nothing thing via the "mc.preschedule" argument to mclapply.  The doMC
package uses the backend-specific "preschedule" option, which it passes on
to mclapply via the "mc.preschedule" argument.  The doMPI package uses
the backend-specific "chunkSize" option to specify any chunk size, much
like nws.

The iterators and itertools packages contain various functions that create
iterators that allow you to split up data in chunks, so they support "chunking"
in their own way.  That allows you to do manual chunking, as I call it, with
any of the foreach backends.

The snow package has some internal functions that split vectors and matrices
into chunks.  They are used in functions such as parMM, parCapply, and
parRapply.

- Steve


On Tue, Jan 26, 2010 at 9:44 AM, Brian G. Peterson <brian at braverock.com> wrote:
> Mark Kimpel wrote:
>>
>> I have seen references on this list to "chunking" parallel tasks. If I am
>> interpreting this correctly that is to decrease the overhead of multiple
>> system calls. For instance, if I  have a loop of 10000 simple tasks and 10
>> processors, then 10 chunks of 1000 would be processed.
>>
>> Which of the parallel packages has the ability to take "chunk" (or its
>> equivalent) as an argument? I've googled chunk with R and come up with
>> everything but want I'm interested in.
>>
>
> Google "nesting foreach loops"
>
> The foreach package will do what you want.  Steve Weston has posted some
> examples to this list on this topic as well.
>
> Regards,
>
>  - Brian
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>



More information about the R-sig-hpc mailing list