[R-sig-hpc] simple question on R/Rmpi/snow/slurm configuration

Wed Jan 7 16:30:44 CET 2009

Thanks to everyone for helping me sort out these issues. I finally
have our cluster up and running on all my nodes.

Per Dirk's suggestion, below is a short checklist for anyone setting
up a slurm/Rmpi/snow cluster.

1) ensure that UID's and GID's are identical across all nodes.

We are using windows authentication on our Linux servers, so we had to
remove all the local slurm and munge UID's and GID's from /etc/passwd
and create windows users and groups for slurm and munge to ensure
consistency across all nodes.  Alternatively, you can copy
/etc/password to all the remote nodes, but that is a little bit of a
maintenance nightmare.

2) make sure all your nodes have the same munge.key.

See, "Creating a Secret Key" on this page:
http://home.gna.org/munge/install_guide.html

3) make sure all nodes have the same slurm.key and slurm.conf.

See: "Create OpenSSL keys" on this page:
https://computing.llnl.gov/linux/slurm/quickstart_admin.html

4)  make sure you can ssh to the compute nodes with no password.

Here is a good site:
http://wiki.freaks-unidos.net/ssh%20without%20password
Our setup has /home mounted on all nodes, so just storing the keys in
/home/username/.ssh works.  If remote nodes do not have /home mounted,
then you will need a different setup. This must be done separately for
all users who will use the cluster.

5) try very hard to use the same Linux distribution across all nodes.

Unfortunately, for us, this is not the case.  Our main server is
RHEL5, and all our nodes are Ubuntu.  I had to manually
compile/install openMPI on the Redhat server (as I was very unhappy
with their packaged version).  My issue yesterday was due to orterun
being installed in /usr/local/bin on the controller node (Redhat), and
installed in /usr/bin on the compute nodes (Ubuntu).  openMPI seems to
assume that orterun is in the same location on all machines.  Which
resulted in the following error in slurmd.log:
[Jan 05 14:05:00] [57.0] execve(): /usr/local/bin/orterun: No such
file or directory

Recompiling openMPI on the RHEL server and making sure the locations
of the orterun binary are the same as on the compute nodes finally
fixed the problem.

6) in addition to rebooting nodes also use "sudo scontrol reconfigure"
to make sure that the slurm.conf file is reloaded on compute nodes.

we kept getting jobs stuck in completing state due to a uid/gid
problem.  Which showed the following error:
[Dec 31 12:58:22] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[Dec 31 12:58:22] debug:  _rpc_terminate_job, uid = 11667
[Dec 31 12:58:22] error: Security violation: kill_job(2) from uid 11667

this problem was finally resolved by rebooting all the compute nodes
and using sudo scontrol reconfigure on all the nodes.

7) verify each component independently. per Dirk: basic MPI with one
of the helloWorld examples. Then Rmpi. Then snow. Then slurm.

this allowed me to find the ssh problem with MPI, since slurm/munge
are happy to authenticate with their shared keys rather than using
ssh.
_________________________________________________________________________________________________________________________

I hope this checklist can serve as a useful guide for anyone who faces
the harrowing task of setting up a cluster.  Now that the hard part is
done we are seeing close to linear speedups on our simulations, so the
end result is worth the pain.

The next chore for me is node maintenance.  Dirk has suggested dsh
(dancer's shell):
http://www.netfort.gr.jp/~dancer/software/dsh.html.en and Moe at LLNL
has suggested pdsh https://sourceforge.net/projects/pdsh/.  If anyone
has any additional suggestions, I would love to hear about it.

Cheers,
Whit