[R-sig-hpc] snow's clusterApplyLB quits when a slave node stops

Stephen Weston stephen.b.weston at gmail.com
Fri May 21 17:45:42 CEST 2010


Dan,

So you don't care what is in the result list returned by clusterApplyLB?
You only care about the files created by the tasks, just as if you were
submitting a bunch of independent batch jobs?  Do you care if a result
file is partially written when it gets killed?  Or is that condition easy to
detect?

Also, what snow transport are you using?

- Steve


On Fri, May 21, 2010 at 11:14 AM, Daniel Elliott
<danelliottster at gmail.com> wrote:
> Thanks, Steve.
>
> I know what I want and it is a good starting point for the other
> solutions you mentioned: just keep sending jobs to the slave nodes
> even when one of them dies.  For me, each job is totally independent
> and the results are saved to a file so I just want the thing to keep
> going.
>
> I would be happy to be a part of any solution...
>
> - dan
>
> On Fri, May 21, 2010 at 8:36 AM, Stephen Weston
> <stephen.b.weston at gmail.com> wrote:
>> Dan,
>>
>> The snow package wasn't designed to be fault tolerant, and so
>> I don't think it is surprising if clusterApplyLB hangs when a
>> job gets killed.  I don't know why you've only started seeing
>> this behavior lately, especially since the version of snow hasn't
>> changed in quite a while.
>>
>> You might want to investigate the snowFT package, which is
>> available on CRAN.  It depends on PVM/rpvm, however, so you
>> can't use MPI with snowFT, for example.
>>
>> You might want to think about what behavior you'd like to see
>> if a job is killed.  Some people want the job to be automatically
>> resubmitted, but maybe you just want an appropriate error
>> reported for that job.  Some people are happy as long as
>> the whole run doesn't hang forever, even if all of the results
>> are lost.  Depending on your needs, you might be able to
>> figure out a solution to the problem.  It could also help you to
>> evaluate whether someone else's proposed solution meets
>> your needs.
>>
>> - Steve
>>
>>
>> On Thu, May 20, 2010 at 10:38 PM, Daniel Elliott
>> <danelliottster at gmail.com> wrote:
>>> Hello,
>>>
>>> I use SNOW to run a large number of processes on a large number of
>>> computers many of which are in a lab.  Sometimes my jobs are killed
>>> for various reasons.  Lately, this has caused the clusterApplyLB
>>> function to stop running which means no additional jobs are run.
>>>
>>> Is there something I can do about this?  I am pretty sure that this
>>> was not happening a few months ago (clusterApplyLB would keep running
>>> jobs even when one of the slaves went down).
>>>
>>> Thanks.
>>>
>>> - dan elliott
>>>
>>> _______________________________________________
>>> R-sig-hpc mailing list
>>> R-sig-hpc at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>
>>
>



More information about the R-sig-hpc mailing list