[R-sig-hpc] How a Slurm bug made parallel R jobs segfault, and our workaround

Wed Apr 6 11:25:19 CEST 2016

Hey all, I thought I'd share this since we spent quite some time with it and it might be relevant to some of you. Or at least amusing, possibly.

So. Our customers run their stuff on our cluster, and resources are allocated using Slurm, business as usual. Out of the blue one customer finds that their parallel R jobs that used to run fine now cause a segmentation fault. Sure enough, so did my previously working test scripts. Ordinary one core sessions work and nothing with our R installation had changed. So it is something in the system environment that changed, but nothing else breaks, just R. Huh?

After a few days and a lot of coffee, the root cause is found: briefly, a very recent Slurm version update introduced or exposed a part where the name of the binary that was started is looked at, and there's a test of the form "if string length is greater than one", not zero ... So if the binary is called just "R", the test result is wrong and madness ensues.

As a workaround, a symlink called RR was created next to the R binary in lib64/R/bin/exec, pointing to that binary, and then a tweak in the wrapper starting script to make it say R_binary="${R_HOME}/bin/exec${R_ARCH}/RR" instead. Works like a charm, it seems.

What I found funny is that this is definitely not the first time the name of the thing being just one letter causes some kind of trouble, probably won't be the last. Of course it is a genuine bug elsewhere, but I at least burst out laughing when I heard this. *That* is what was so special about it!

BR,
Seija S.