[R-sig-hpc] snow's clusterApplyLB quits when a slave node stops

Stephen Weston stephen.b.weston at gmail.com
Fri May 21 15:36:50 CEST 2010


Dan,

The snow package wasn't designed to be fault tolerant, and so
I don't think it is surprising if clusterApplyLB hangs when a
job gets killed.  I don't know why you've only started seeing
this behavior lately, especially since the version of snow hasn't
changed in quite a while.

You might want to investigate the snowFT package, which is
available on CRAN.  It depends on PVM/rpvm, however, so you
can't use MPI with snowFT, for example.

You might want to think about what behavior you'd like to see
if a job is killed.  Some people want the job to be automatically
resubmitted, but maybe you just want an appropriate error
reported for that job.  Some people are happy as long as
the whole run doesn't hang forever, even if all of the results
are lost.  Depending on your needs, you might be able to
figure out a solution to the problem.  It could also help you to
evaluate whether someone else's proposed solution meets
your needs.

- Steve


On Thu, May 20, 2010 at 10:38 PM, Daniel Elliott
<danelliottster at gmail.com> wrote:
> Hello,
>
> I use SNOW to run a large number of processes on a large number of
> computers many of which are in a lab.  Sometimes my jobs are killed
> for various reasons.  Lately, this has caused the clusterApplyLB
> function to stop running which means no additional jobs are run.
>
> Is there something I can do about this?  I am pretty sure that this
> was not happening a few months ago (clusterApplyLB would keep running
> jobs even when one of the slaves went down).
>
> Thanks.
>
> - dan elliott
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>



More information about the R-sig-hpc mailing list