[R-sig-hpc] simple question on R/Rmpi/snow/slurm configuration

Wed Jan 7 21:37:30 CET 2009

Hi Whit,

Regarding 4), my slurm setting is actually to disable it so users cannot
remote login or exec on any remote nodes. It seems slurm/munge take care
of authentication and remote executions.

Hao

PS: in /etc/pam.d/common-auth, the following line was added
account    required     /lib/security/pam_slurm.so

Whit Armstrong wrote:
> Thanks to everyone for helping me sort out these issues. I finally
> have our cluster up and running on all my nodes.
>
> Per Dirk's suggestion, below is a short checklist for anyone setting
> up a slurm/Rmpi/snow cluster.
>
> 1) ensure that UID's and GID's are identical across all nodes.
>
> We are using windows authentication on our Linux servers, so we had to
> remove all the local slurm and munge UID's and GID's from /etc/passwd
> and create windows users and groups for slurm and munge to ensure
> consistency across all nodes.  Alternatively, you can copy
> /etc/password to all the remote nodes, but that is a little bit of a
> maintenance nightmare.
>
> 2) make sure all your nodes have the same munge.key.
>
> See, "Creating a Secret Key" on this page:
> http://home.gna.org/munge/install_guide.html
>
> 3) make sure all nodes have the same slurm.key and slurm.conf.
>
> See: "Create OpenSSL keys" on this page:
> https://computing.llnl.gov/linux/slurm/quickstart_admin.html
>
> 4)  make sure you can ssh to the compute nodes with no password.
>
> Here is a good site:
> http://wiki.freaks-unidos.net/ssh%20without%20password
> Our setup has /home mounted on all nodes, so just storing the keys in
> /home/username/.ssh works.  If remote nodes do not have /home mounted,
> then you will need a different setup. This must be done separately for
> all users who will use the cluster.
>
> 5) try very hard to use the same Linux distribution across all nodes.
>
> Unfortunately, for us, this is not the case.  Our main server is
> RHEL5, and all our nodes are Ubuntu.  I had to manually
> compile/install openMPI on the Redhat server (as I was very unhappy
> with their packaged version).  My issue yesterday was due to orterun
> being installed in /usr/local/bin on the controller node (Redhat), and
> installed in /usr/bin on the compute nodes (Ubuntu).  openMPI seems to
> assume that orterun is in the same location on all machines.  Which
> resulted in the following error in slurmd.log:
> [Jan 05 14:05:00] [57.0] execve(): /usr/local/bin/orterun: No such
> file or directory
>
> Recompiling openMPI on the RHEL server and making sure the locations
> of the orterun binary are the same as on the compute nodes finally
> fixed the problem.
>
> 6) in addition to rebooting nodes also use "sudo scontrol reconfigure"
> to make sure that the slurm.conf file is reloaded on compute nodes.
>
> we kept getting jobs stuck in completing state due to a uid/gid
> problem.  Which showed the following error:
> [Dec 31 12:58:22] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> [Dec 31 12:58:22] debug:  _rpc_terminate_job, uid = 11667
> [Dec 31 12:58:22] error: Security violation: kill_job(2) from uid 11667
>
> this problem was finally resolved by rebooting all the compute nodes
> and using sudo scontrol reconfigure on all the nodes.
>
> 7) verify each component independently. per Dirk: basic MPI with one
> of the helloWorld examples. Then Rmpi. Then snow. Then slurm.
>
> this allowed me to find the ssh problem with MPI, since slurm/munge
> are happy to authenticate with their shared keys rather than using
> ssh.
> _________________________________________________________________________________________________________________________
>
>
> I hope this checklist can serve as a useful guide for anyone who faces
> the harrowing task of setting up a cluster.  Now that the hard part is
> done we are seeing close to linear speedups on our simulations, so the
> end result is worth the pain.
>
> The next chore for me is node maintenance.  Dirk has suggested dsh
> (dancer's shell):
> http://www.netfort.gr.jp/~dancer/software/dsh.html.en and Moe at LLNL
> has suggested pdsh https://sourceforge.net/projects/pdsh/.  If anyone
> has any additional suggestions, I would love to hear about it.
>
> Cheers,
> Whit
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>

-- 
Department of Statistics & Actuarial Sciences
Fax Phone#:(519)-661-3813
The University of Western Ontario
Office Phone#:(519)-661-3622
London, Ontario N6A 5B7
http://www.stats.uwo.ca/faculty/yu