[R-sig-hpc] simple question on R/Rmpi/snow/slurm configuration
hyu at stats.uwo.ca
Wed Jan 7 21:37:30 CET 2009
Regarding 4), my slurm setting is actually to disable it so users cannot
remote login or exec on any remote nodes. It seems slurm/munge take care
of authentication and remote executions.
PS: in /etc/pam.d/common-auth, the following line was added
account required /lib/security/pam_slurm.so
Whit Armstrong wrote:
> Thanks to everyone for helping me sort out these issues. I finally
> have our cluster up and running on all my nodes.
> Per Dirk's suggestion, below is a short checklist for anyone setting
> up a slurm/Rmpi/snow cluster.
> 1) ensure that UID's and GID's are identical across all nodes.
> We are using windows authentication on our Linux servers, so we had to
> remove all the local slurm and munge UID's and GID's from /etc/passwd
> and create windows users and groups for slurm and munge to ensure
> consistency across all nodes. Alternatively, you can copy
> /etc/password to all the remote nodes, but that is a little bit of a
> maintenance nightmare.
> 2) make sure all your nodes have the same munge.key.
> See, "Creating a Secret Key" on this page:
> 3) make sure all nodes have the same slurm.key and slurm.conf.
> See: "Create OpenSSL keys" on this page:
> 4) make sure you can ssh to the compute nodes with no password.
> Here is a good site:
> Our setup has /home mounted on all nodes, so just storing the keys in
> /home/username/.ssh works. If remote nodes do not have /home mounted,
> then you will need a different setup. This must be done separately for
> all users who will use the cluster.
> 5) try very hard to use the same Linux distribution across all nodes.
> Unfortunately, for us, this is not the case. Our main server is
> RHEL5, and all our nodes are Ubuntu. I had to manually
> compile/install openMPI on the Redhat server (as I was very unhappy
> with their packaged version). My issue yesterday was due to orterun
> being installed in /usr/local/bin on the controller node (Redhat), and
> installed in /usr/bin on the compute nodes (Ubuntu). openMPI seems to
> assume that orterun is in the same location on all machines. Which
> resulted in the following error in slurmd.log:
> [Jan 05 14:05:00] [57.0] execve(): /usr/local/bin/orterun: No such
> file or directory
> Recompiling openMPI on the RHEL server and making sure the locations
> of the orterun binary are the same as on the compute nodes finally
> fixed the problem.
> 6) in addition to rebooting nodes also use "sudo scontrol reconfigure"
> to make sure that the slurm.conf file is reloaded on compute nodes.
> we kept getting jobs stuck in completing state due to a uid/gid
> problem. Which showed the following error:
> [Dec 31 12:58:22] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> [Dec 31 12:58:22] debug: _rpc_terminate_job, uid = 11667
> [Dec 31 12:58:22] error: Security violation: kill_job(2) from uid 11667
> this problem was finally resolved by rebooting all the compute nodes
> and using sudo scontrol reconfigure on all the nodes.
> 7) verify each component independently. per Dirk: basic MPI with one
> of the helloWorld examples. Then Rmpi. Then snow. Then slurm.
> this allowed me to find the ssh problem with MPI, since slurm/munge
> are happy to authenticate with their shared keys rather than using
> I hope this checklist can serve as a useful guide for anyone who faces
> the harrowing task of setting up a cluster. Now that the hard part is
> done we are seeing close to linear speedups on our simulations, so the
> end result is worth the pain.
> The next chore for me is node maintenance. Dirk has suggested dsh
> (dancer's shell):
> http://www.netfort.gr.jp/~dancer/software/dsh.html.en and Moe at LLNL
> has suggested pdsh https://sourceforge.net/projects/pdsh/. If anyone
> has any additional suggestions, I would love to hear about it.
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
Department of Statistics & Actuarial Sciences
The University of Western Ontario
London, Ontario N6A 5B7
More information about the R-sig-hpc