[R-sig-hpc] I can do mpi.apply but not foreach with doMPI

Thu Aug 27 12:00:50 CEST 2015

Great, thank you Stephen! Yes, we use Open MPI. The reason I used the Rprofile way is that it was the only way to do MPI related R stuff the existing guides here mentioned. That lead me to believe that it was what I should do, but ok, that's what I need to understand next (and then prompt for better guides!)

And thank you George as well, you've answered questions I didn't yet know how to formulate :) Very much appreciated.

BR,
Seija S.

----- Original Message -----
From: "Stephen Weston" <stephen.b.weston at gmail.com>
To: "Seija Sirkiä" <seija.sirkia at csc.fi>
Cc: "r-sig-hpc" <r-sig-hpc at r-project.org>
Sent: Wednesday, 26 August, 2015 16:35:14
Subject: Re: [R-sig-hpc] I can do mpi.apply but not foreach with doMPI

Hi Seija,

To use doMPI, you shouldn't start the workers in an Rprofile as
described in the Rmpi documentation since those workers can only be
used by functions in the Rmpi package such as mpi.apply. In other
words, you don't want to see messages like:

  master (rank 0, comm 1) of size 4 is running on: c1
  slave1 (rank 1, comm 1) of size 4 is running on: c1

When using doMPI from an interactive R session, the workers shouldn't
be started until you execute "startMPIcluster()":

 > cl <- startMPIcluster(3)
         3 slaves are spawned successfully. 0 failed.

What MPI implementation are you using?  I believe the only reason to
start the workers in an Rprofile is if your MPI implementation doesn't
have spawn support. Open MPI has great spawn support and is able to
spawn workers from an interactive R session.  MPICH2 has spawn
support, but I believe it can only spawn workers if the R session was
started via mpirun, so I think that Open MPI is preferable for use
with Rmpi.

Even if you're using an MPI implementation without spawn support,
doMPI can use workers that are all started by mpirun. However, you do
need spawn support to start workers from an interactive R with doMPI.

Regards,

Steve Weston

On Wed, Aug 26, 2015 at 4:12 AM, Seija Sirkiä <seija.sirkia at csc.fi> wrote:
> Hi all,
>
> I'm trying to learn to do parallel computing with R and foreach on this cluster of ours but clearly I'm doing something wrong and I can't figure out what.
>
> Briefly, I'm sitting on a Linux cluster, about which the user guide says that the login nodes are based on the RHEL6, while the computing nodes use CentOS 6. Jobs are submitted using SLURM.
>
> So there I go, requesting a short interactive test session using:
> srun -p test -n4 -t 0:15:00 --pty Rmpi --no-save
>
> Here Rmpi is the modified R_home_dir/bin/R mentioned in the Rprofile file that comes with Rmpi ("This R profile can be used when a cluster does not allow spawning --- Another way is to modify R_home_dir/bin/R by adding...").
>
> When my session starts, I get these messages:
> master (rank 0, comm 1) of size 4 is running on: c1
> slave1 (rank 1, comm 1) of size 4 is running on: c1
> slave2 (rank 2, comm 1) of size 4 is running on: c1
> slave3 (rank 3, comm 1) of size 4 is running on: c1
> before the prompt. Sounds good, and if I go check top on the c1 node, there I see 3 R's churning away happily at 100% cpu time, and one not doing much. As it should be, as far as I can tell?
>
> If I then run this little test:
>
> funtorun<-function(k) {
>   system.time(sort(runif(1e7)))
> }
>
> system.time(a<-mpi.apply(1:3,funtorun))
> a
>
> b<-a
> system.time(for(i in 1:3) b[[i]]<-system.time(sort(runif(1e7))))
> b
>
> it goes through nicely, and the mpi.apply part takes about 2.6 seconds in total, with each of the 3 sorts taking about that same time, while the latter for-loop takes about 7 seconds in total, each of the three sorts taking about 2.3 seconds. Nice, that tells me the workers will do stuff, simultaneously, when requested correctly.
>
> But if I try this instead:
>
> library(doMPI)
> cl<-startMPIcluster()
> registerDoMPI(cl)
> system.time(a<-foreach(i=1:3) %dopar% system.time(sort(runif(1e7))))
>
> it just hangs up at the foreach line, and never gets through, and only gets killed at the end of the reserved 15 minutes or when I scancel the whole job myself. None of the lines give any errors.
>
> So what am I doing wrong? I have a hunch this has something to do with how my workers are started, since I never get to do those mpirun commands that the doMPI manual speaks of. But despite my efforts of reading the manual and the documentation of startMPIcluster I haven't figured out what else to try.
>
> Many thanks in advance for your time!
>
> BR,
> Seija Sirkiä
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc