[R-sig-hpc] "chunking" parallel tasks

Tue Jan 26 19:32:46 CET 2010

One other approach, were the computation per chunk runs into
several(tens of ) minutes, is to monitor the running time of long
running tasks(each working on a chunk), if greater than a cut off,
split the chunk and assign to unused (or lesser loaded) machines. If a
task for a particular chunk finishes earlier than some task for a
duplicated task, invalidate the latter and kill it.
Of course, the run time for a chunk should be greater(much) than the
cost of duplicating a chunk, reading it in and starting new tasks. To
implement this, one would have write a system which actually monitors
the running time tasks, the chunking and duplication.

I think Hadoop Mapreduce does something similar, though it is most
certainly not the best tool for some tasks.

Regards
Saptarshi

On Tue, Jan 26, 2010 at 12:54 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 01/26/2010 09:51 AM, Norm Matloff wrote:
>> There are some very sophisticated strategies for chunking out there in
>> the parallel processing world, aimed at dealing with the load balancing
>> problem.  The idea is to use very large chunks at the beginning and
>> middle stages of the computation, to minimize communication overhead and
>> the like, but then switch to smaller ones near the end in order to keep
>> all the processes busy.  For example, the "guided" option in OpenMP does
>> this.
>>
>> However, in my experience, it is seldom necessary to resort to this, as
>> load balancing is typically not a problem.  One can even show this
>> theoretically:  Say task times are T[1], T[2], etc., and are i.i.d.  The
>> standard deviation of sum(T[1:k]), divided by the mean, goes to 0 as k
>> goes to infinity--i.e. the sum is essentially constant.  Of course, that
>> is an idealized model, but again in practice I have found that load
>> balancing is not much of an issue.
>>
>> For that reason and because of the communication overhead, in most cases
>> it is actually faster to statically assign 1/p of the tasks to each of
>> the p processes, i.e. not do chunking.
>
> Agreed, and for the matrix operations Mark hinted at one likely gets the
> most benefit by using R's vectorized matrix operations (rather than
> *apply), then by using vectorized blas/lapack libraries, then multicore.
> Probably distributing tasks across nodes in a cluster (e.g., mpi) for
> matrix operations is almost always a losing proposition -- communication
> costs are just too high.
>
> Martin
>>
>> Norm Matloff
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>