[R] snow's makeCluster hanging (using Rmpi)

Luke Tierney luke at stat.uiowa.edu
Wed Nov 8 16:46:00 CET 2006


Looks like the daemon side is OK but not the Rmpi client side. Might
be worth checking whether the Rmpi packages were built against the
same version of lam labraries--that might affect default settings for
RPI.  Otherwise, assuming the error message is correct, someting would
seem to be causing differences in RPI settings. usysv seems an odd
choice but maybe it makes sense.  You might be able to use either
environment variables, setting in the hosts file, or arguments to
lamboot to restrict the available rpi modules and so force a common
one, but I'm not sure about this.

You might also try pvp/rpvm and see if that is easier to use.


Best,

luke

On Wed, 8 Nov 2006, Ramon Diaz-Uriarte wrote:

> On Tuesday 07 November 2006 19:28, Randall C Johnson [Contr.] wrote:
>> On 11/7/06 11:28 AM, "Ramon Diaz-Uriarte" <rdiaz at cnio.es> wrote:
>>> On Tuesday 07 November 2006 15:56, Randall C Johnson [Contr.] wrote:
>>>> Hello everyone,
>>>> I've been fiddling around with the snow and Rmpi packages on my new
>>>> Intel Mac, and have run into a few problems. When I make a cluster on my
>>>> machine, both slaves start up just fine, and everything works as
>>>> expected. When I try to make a cluster including another networked
>>>> machine it hangs. I've followed the suggestions at
>>>> http://finzi.psych.upenn.edu/R/Rhelp02a/archive/83086.html and
>>>> http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html but to no avail.
>>>> Everything seems to start up fine using lamboot, but then hangs when
>>>> making the cluster in R. Making a cluster with 2 slaves seems to work
>>>> fine, but if I increase the number (to use the networked machines) it
>>>> hangs again.
>>>>
>>>> I've tried networking to another Mac, and also to a machine running Red
>>>> Hat Linux. Both machines can set up their own local clusters. Does
>>>> anyone have any ideas?
>>>
>>> Dear Randy,
>>>
>>> A few suggestions:
>>>
>>> a) make sure there are no firewalls; I assume this is actually the case,
>>> but anyway;
>>
>> I don't think I have any firewalls running. I checked and they all seem to
>> be disabled...
>>
>
> you can use (under GNU/Linux at least) the command (as root)
>
> iptables -L
>
> If there are no iptables-based firewall you should see something like:
>
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
>
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
>
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
>
> Make sure this is OK in all the machines.
>
>
>>> b) what happens if you lamboot outside R (and create a universe with a
>>> local and a networked machine) and then you do: "lamexec -np 6 hostname"?
>>
>> This prints out the host names of each machine as expected.
>>
>
> OK, so its not lam itself (so a) is probably unneeded).
>
>
>>> c) are the Rmpi and snow installed in the same directories in the
>>> different machines? are there version differences in Rmpi (or Snow)
>>> between machines?
>>
>> I've installed the same versions, but they are in different directories...
>>
>
> I think I remember that having Rmpi and Snow in different directories tended
> to cause problems. Now, I always place them in the same directory. I think
> that some sh Rmpi script looks for other scripts, and if they are not where
> it expect thems, it fails.
>
>
>> I also tried an example per Luke Tierney's suggestion using only Rmpi, and
>> I get the following error when trying to spawn the Rslaves after starting
>> up with lamboot (outside of R). I tried to use laminfo, but I'm not sure
>> what I'm looking for or how to use the information given...
>>
>>> library(Rmpi)
>>> mpi.spawn.Rslaves()
>>
>> ---------------------------------------------------------------------------
>> -
>>
>> It seems that [at least] one of the child processes that was started
>> by MPI_Comm_spawn* chose a different RPI than the parent MPI
>> application.  For example, one (of the) child process(es) that
>> differed from the parent is shown below:
>>
>>     Parent application: MPI_Comm_spawn
>>     Child MPI_COMM_WORLD rank usysv (v7.1.0): 0
>>
>> All MPI processes must choose the same RPI module and version when
>> they start.  Check your SSI settings and/or the local environment
>> variables on each node.
>> ---------------------------------------------------------------------------
>> - R(26444) malloc: ***  Deallocation of a pointer not malloced: 0x16379a0;
>> This could be a double free(), or free() called with the middle of an
>> allocated block; Try setting environment variable MallocHelp to see tools
>> to help debug
>> Error in mpi.comm.spawn(slave = system.file("Rslaves.sh", package =
>> "Rmpi"),
>>
>>     MPI_Error_string: unclassified
>>
>
> Now that is way over my head. A few things I'd check: Are you mixing 32-bit
> with 64-bit machines? (I've done that in the past, x86 and x86_64, without
> apparent problems, but I've never used Macs for this). Can you try using two
> different machines with the same architecture? What about gcc compilers: are
> you using very different versions on each machine?
>
>
> Best,
>
> R.
>
>
>>> HTH,
>>>
>>> R.
>>>
>>>> Thanks,
>>>> Randy
>>>>
>>>>> sessionInfo()
>>>>
>>>> R version 2.4.0 Patched (2006-10-03 r39576)
>>>> i386-apple-darwin8.8.2
>>>>
>>>> locale:
>>>> C
>>>>
>>>> attached base packages:
>>>> [1] "methods"   "stats"     "graphics"  "grDevices" "utils"
>>>> "datasets" [7] "base"
>>>>
>>>> other attached packages:
>>>>    Rmpi    snow
>>>> "0.5-3" "0.2-2"
>
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu



More information about the R-help mailing list