[R-sig-hpc] snow's clusterApplyLB quits when a slave node stops

Fri May 21 18:19:17 CEST 2010

A (partially written) file would be somewhat annoying, but otherwise,
no, I do not care what happens to an unfinished job.  I detect these
while collecting the results from the thousands of resulting files.

To give you and idea of how crude my SNOW usage is: for years I got by
with a script that spawned more scripts on each machine (one for each
CPU I wanted to use) that read from a common file containing a list of
shell commands.  However, SNOW is much nicer to work with.

Many of the people I work with use SNOW in this way.  Not so much HPC
but trying to run repeated experiments on 100s of multi-core
workstations.  So, while I am at it, R's limit of around 127 socket
connections is also an irritation.

I am using sockets via makeSOCKcluster.

Thank you.

- dan

On Fri, May 21, 2010 at 10:45 AM, Stephen Weston
<stephen.b.weston at gmail.com> wrote:
> Dan,
>
> So you don't care what is in the result list returned by clusterApplyLB?
> You only care about the files created by the tasks, just as if you were
> submitting a bunch of independent batch jobs?  Do you care if a result
> file is partially written when it gets killed?  Or is that condition easy to
> detect?
>
> Also, what snow transport are you using?
>
> - Steve
>
>
> On Fri, May 21, 2010 at 11:14 AM, Daniel Elliott
> <danelliottster at gmail.com> wrote:
>> Thanks, Steve.
>>
>> I know what I want and it is a good starting point for the other
>> solutions you mentioned: just keep sending jobs to the slave nodes
>> even when one of them dies.  For me, each job is totally independent
>> and the results are saved to a file so I just want the thing to keep
>> going.
>>
>> I would be happy to be a part of any solution...
>>
>> - dan
>>
>> On Fri, May 21, 2010 at 8:36 AM, Stephen Weston
>> <stephen.b.weston at gmail.com> wrote:
>>> Dan,
>>>
>>> The snow package wasn't designed to be fault tolerant, and so
>>> I don't think it is surprising if clusterApplyLB hangs when a
>>> job gets killed.  I don't know why you've only started seeing
>>> this behavior lately, especially since the version of snow hasn't
>>> changed in quite a while.
>>>
>>> You might want to investigate the snowFT package, which is
>>> available on CRAN.  It depends on PVM/rpvm, however, so you
>>> can't use MPI with snowFT, for example.
>>>
>>> You might want to think about what behavior you'd like to see
>>> if a job is killed.  Some people want the job to be automatically
>>> resubmitted, but maybe you just want an appropriate error
>>> reported for that job.  Some people are happy as long as
>>> the whole run doesn't hang forever, even if all of the results
>>> are lost.  Depending on your needs, you might be able to
>>> figure out a solution to the problem.  It could also help you to
>>> evaluate whether someone else's proposed solution meets
>>> your needs.
>>>
>>> - Steve
>>>
>>>
>>> On Thu, May 20, 2010 at 10:38 PM, Daniel Elliott
>>> <danelliottster at gmail.com> wrote:
>>>> Hello,
>>>>
>>>> I use SNOW to run a large number of processes on a large number of
>>>> computers many of which are in a lab.  Sometimes my jobs are killed
>>>> for various reasons.  Lately, this has caused the clusterApplyLB
>>>> function to stop running which means no additional jobs are run.
>>>>
>>>> Is there something I can do about this?  I am pretty sure that this
>>>> was not happening a few months ago (clusterApplyLB would keep running
>>>> jobs even when one of the slaves went down).
>>>>
>>>> Thanks.
>>>>
>>>> - dan elliott
>>>>
>>>> _______________________________________________
>>>> R-sig-hpc mailing list
>>>> R-sig-hpc at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>>>>
>>>
>>
>