[R-sig-hpc] snow's clusterApplyLB quits when a slave node stops

Daniel Elliott danelliottster at gmail.com
Fri May 21 17:14:37 CEST 2010


Thanks, Steve.

I know what I want and it is a good starting point for the other
solutions you mentioned: just keep sending jobs to the slave nodes
even when one of them dies.  For me, each job is totally independent
and the results are saved to a file so I just want the thing to keep
going.

I would be happy to be a part of any solution...

- dan

On Fri, May 21, 2010 at 8:36 AM, Stephen Weston
<stephen.b.weston at gmail.com> wrote:
> Dan,
>
> The snow package wasn't designed to be fault tolerant, and so
> I don't think it is surprising if clusterApplyLB hangs when a
> job gets killed.  I don't know why you've only started seeing
> this behavior lately, especially since the version of snow hasn't
> changed in quite a while.
>
> You might want to investigate the snowFT package, which is
> available on CRAN.  It depends on PVM/rpvm, however, so you
> can't use MPI with snowFT, for example.
>
> You might want to think about what behavior you'd like to see
> if a job is killed.  Some people want the job to be automatically
> resubmitted, but maybe you just want an appropriate error
> reported for that job.  Some people are happy as long as
> the whole run doesn't hang forever, even if all of the results
> are lost.  Depending on your needs, you might be able to
> figure out a solution to the problem.  It could also help you to
> evaluate whether someone else's proposed solution meets
> your needs.
>
> - Steve
>
>
> On Thu, May 20, 2010 at 10:38 PM, Daniel Elliott
> <danelliottster at gmail.com> wrote:
>> Hello,
>>
>> I use SNOW to run a large number of processes on a large number of
>> computers many of which are in a lab.  Sometimes my jobs are killed
>> for various reasons.  Lately, this has caused the clusterApplyLB
>> function to stop running which means no additional jobs are run.
>>
>> Is there something I can do about this?  I am pretty sure that this
>> was not happening a few months ago (clusterApplyLB would keep running
>> jobs even when one of the slaves went down).
>>
>> Thanks.
>>
>> - dan elliott
>>
>> _______________________________________________
>> R-sig-hpc mailing list
>> R-sig-hpc at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>
>



More information about the R-sig-hpc mailing list