[R-sig-hpc] Rmpi: mpi.close.Rslaves() 'hangs'

Marius Hofert marius.hofert at uwaterloo.ca
Thu Sep 28 08:48:24 CEST 2017


Hi Ei-ji,

Thanks for your help.

You lost me a bit... Here is what I got:

1) I can confirm that I have Open MPI 2.1.1 (mpirun --version), so
that is most likely the source of the problem (and the mpi version
probably changed since I last used Rmpi and move to different
hardware).

2) As I understand, you suggest to use Rmpi's mpi.comm.free(comm)
instead of mpi.comm.disconnect(comm). I thus adapted
mpi.close.Rslave() (which 'hangs') to always call mpi.comm.free().
More precisely, I defined mpi.close.Rslave2() which is changed in the
last part in comparison to mpi.close.Rslave():

    if (comm > 0) {
        ## Changed (as it 'hangs' in openmpi-2.x):
        ## if (is.loaded("mpi_comm_disconnect"))
        ##     mpi.comm.disconnect(comm)
        ## else
        mpi.comm.free(comm)
    }

If I execute the minimal working example with this new
mpi.close.Rslave2() at the end, something strange happens: *While*
doing the computation, 'htop' doesn't show the two cores separately,
but *after* executing it, the two cores show up and I need to manually
'kill -9 <PID>' them.

Any ideas?

As a 'user' (not: maintainer), I don't think there's much I can do. I
feel this needs to be addressed by the maintainer of Rmpi (as opposed
to me hacking into mpi.close.Rslave()). I contacted the maintainer of
Rmpi again, but still no response.

Thanks & cheers,
Marius


On Thu, Sep 28, 2017 at 6:55 AM, Ei-ji Nakama <nakama at ki.rim.or.jp> wrote:
> Hi,
>
> using openmpi-2.x same problem occurs on Linux.
> # There is no problem with openmpi 1.6 and openmpi 1.10
>
> $ orte-ps
> ...
> $ echo "bt" | gdb -p <PID>
> Looping in MPI_Comm_disconnect...
>
> $ mkdir -p ~/.openmpi ; echo pmix_base_verbose=100 >> ~/.openmpi/mca-params.conf
> Debug information can be obtained by setting the above and executing
> the script...
> <<snip : debug result is long>>
> l was look it up ((but a little))
>
> When PMIX is used, the value is set to the following environment variable.
>> grep("^PMIX",names(Sys.getenv()),value=TRUE)
> [1] "PMIX_DEBUG"         "PMIX_NAMESPACE"     "PMIX_RANK"
> [4] "PMIX_SECURITY_MODE" "PMIX_SERVER_URI"
>
> Well, as an alternative, there is MPI_Comm_free, so if using PMIX it
> seems to be better to change to use MPI_Comm_Free without using
> MPI_Comm_disconnect.
>
> diff -ruN Rmpi.orig/R/Rparutilities.R Rmpi/R/Rparutilities.R
> --- Rmpi.orig/R/Rparutilities.R    2016-05-31 23:12:53.000
> 000000 +0900
> +++ Rmpi/R/Rparutilities.R    2017-09-28 12:41:50.545396494 +0900
> @@ -332,8 +332,12 @@
>      }
>  #     mpi.barrier(comm)
>      if (comm >0){
> -        if (is.loaded("mpi_comm_disconnect"))
> -            mpi.comm.disconnect(comm)
> +        if (is.loaded("mpi_comm_disconnect")){
> +            if (Sys.getenv("PMIX_NAMESPACE")=="")
> +                mpi.comm.disconnect(comm)
> +            else
> +                mpi.comm.free(comm)
> +        }
>          else
>              mpi.comm.free(comm)
>      }
> diff -ruN Rmpi.orig/inst/Rslaves.sh Rmpi/inst/Rslaves.sh
> --- Rmpi.orig/inst/Rslaves.sh    2012-09-05 01:17:59.000000000 +0900
> +++ Rmpi/inst/Rslaves.sh    2017-09-27 15:07:05.205719837 +0900
> @@ -14,7 +14,7 @@
>
>  if  [ "$3" = "needlog" ]; then
>      hn=`hostname -s`
> -    $R_HOME/bin/R --no-init-file --slave --no-save -f  $1 > $hn.$2.$$.log 2>&1
> +    exec $R_HOME/bin/R --no-init-file --slave --no-save -f  $1 >
> $hn.$2.$$.log 2>&1
>  else
> -    $R_HOME/bin/R --no-init-file --slave --no-save -f  $1 > /dev/null 2>&1
> +    exec $R_HOME/bin/R --no-init-file --slave --no-save -f  $1 > /dev/null 2>&1
>  fi
> diff -ruN Rmpi.orig/inst/slavedaemon.R Rmpi/inst/slavedaemon.R
> --- Rmpi.orig/inst/slavedaemon.R    2013-02-23 13:07:54.000000000 +0900
> +++ Rmpi/inst/slavedaemon.R    2017-09-28 11:45:19.598288064 +0900
> @@ -16,6 +16,9 @@
>  repeat
>      try(eval(mpi.bcast.cmd(rank=0,comm=.comm, nonblock=.nonblock,
> sleep=.sleep),envir=.GlobalEnv),TRUE)
>  print("Done")
> -invisible(mpi.comm.disconnect(.comm))
> +if(Sys.getenv("PMIX_NAMESPACE")=="")
> +    invisible(mpi.comm.disconnect(.comm))
> +else
> +    invisible(mpi.comm.free(.comm))
>  invisible(mpi.comm.set.errhandler(0))
>  mpi.quit()
>
> Best Regards,
> --



More information about the R-sig-hpc mailing list