[R] NetWorkSpace from REvolution; Distributed Computing setup questions

Timothy Murphy timothyjosephmurphy at gmail.com
Fri Oct 29 23:20:33 CEST 2010


***Summary:***

I'm setting up a cluster using netWorkSpace, and I'm having issues
with the sleigh initialization. My R function to initialize the sleigh
succeeds and the sleigh appears to be ready, but I get apparently
conflicting information from "status(s)", "rankCount(s)", and "s"; and
basic sleigh functions cause the sleigh to hang indefinitely.

Also, the log file contains an error that indicates that the script is
trying to find a file in a nonexistent directory:
"/usr/local/lib/R/site-library/nws/bin/RNWSSleighWorker.sh: 37:
/Library/Frameworks/R.framework/Resources/bin/R: not found" (see
section 4).

I've spent quite a bit of time trying to debug this, and I've gathered
here all the information that I think may be pertinent to solving the
problem. The following is therefore a bit lengthy, but I think
complete (as far as I'm able to tell from the existing documentation).
It's organized into sections roughly by the topic tested.

So if you're familiar with the workings of netWorkSpaces, I would be
very grateful if you would take a look at my diagnostics below and
tell me if you can identify the problem.

***Details:***

***Section 1***
Currently my setup is:

MASTER:
MacBook Pro running OS X 10.6.4 R-2.11.1, Python 2.6.1, and NWSserver-2.0.0.

WORKER:
Optiplex GX620 running Ubuntu 10.10 (64bit), R-2.11.1, Python 2.6.6,
NWSserver-2.0.0, and NWS-2.0.0.3 (client)
R and Python are in the PATH on both machines; I can start them from
the worker's command line by typing "R" or "python".
The client is able to find the "RNWSSleighWorker.sh" file.

(Note 1: I put the server software on the client because I was getting
an message saying: "No nws server found" each time I tried to install
the client software. I don't know if this is needed)
(Note 2: I plan to set up many more machines if I can get this working)
(Note 3: Originally I was trying this on Windows machines with Cygwin,
but I encountered the same error and figured I could at least rule out
a possible cause by setting it up on a linux machine. Ultimately I
would like to get this working in Windows/Cygwin.)

***Section 2***
The function I used to start the sleigh is:

s=sleigh(
+	nwsHost="172.30.xx.xx",
+	nwsPort=8765,
+	launch=sshcmd,
+	nodeList=c("10.85.xxx.xxx"),
+	scriptExec=envcmd,
+	scriptDir="/usr/local/lib/R/site-library/nws/bin",
+	scriptName="RNWSSleighWorker.sh",
+	workingDir='~/tmp/',
+	logDir='~/tmp/',
+	outfile="outfileTest",
+	user="tj")

This function returns the message below and then clear command prompt:

Executing command:
'/Library/Frameworks/R.framework/Resources/library/nws/bin/SleighWorkerWrapper.sh'
'ssh' '-f' '-x' '-l' 'tj' '10.85.101.109' 'env'
'RSleighName=10.85.101.109'
'RSleighNwsName=sleigh_ride_0450__nwssNGG4LF'
'RSleighUserNwsName=sleigh_user_0452__nwssNGG4LF' 'RSleighID=1'
'RSleighWorkerCount=1'
'RSleighScriptDir=/usr/local/lib/R/site-library/nws/bin'
'RSleighNwsHost=172.30.34.71' 'RSleighNwsPort=8765'
'RSleighWorkingDir=~/tmp/'
'RProg=/Library/Frameworks/R.framework/Resources/bin/R'
'RSleighWorkerOut=sleigh_ride_0450__nwssNGG4LF_0001.txt'
'RSleighLogDir=~/tmp/'
'/usr/local/lib/R/site-library/nws/bin/RNWSSleighWorker.sh'

If I type the name of the sleigh "s" as below, I get information that
makes it look like the sleigh is ready to receive commands:

> s
NWS Sleigh Object
NWS Host:	172.30.xx.xx:8765
Workspace Name:	sleigh_ride_0446__nwssNGG4LF
1 Worker Nodes:	10.85.xxx.xxx

Likewise, if I send a simple ssh command to the worker I get a response:
> system('ssh tj at 10.85.101.109 date')
Fri Oct 29 15:15:10 EDT 2010

I can also communicate values between the two machines using the NWS
web server and the nwsStore() and nwsFetch() functions.

However, if I check the status of the sleigh using "status(s)" or
"rankCount(s)", I get less encouraging information:

> status(s)
$numWorkers
[1] 0
$closed
[1] 0

> rankCount(s6)
[1] 0

***Section 3***
I can access the NWS server through "localhost:8766" and see that
sleighs are being created. There are two entries: a sleigh_ride and a
sleigh_user; but the worker count in the sleigh_ride is also zero.

If I execute either of the following test sleigh functions, the sleigh
will hang indefinitely (though the R terminal will not hang, since I
used blocking=false):
eachWorker(s5, Sys.info, eo=list(blocking=FALSE))
eachWorker(s, function() library(nws), eo=list(blocking=FALSE))

***Section 4***
***CRUX OF THE ISSUE (probably):***
Finally, three files get created in the "~/tmp" directory that I
specified as the logDir and workingDir, named: "outfileTest",
"RSleighSentinelLog_1000_1" , and
"sleigh_ride_0450__nwssNGG4LF_0001.txt". All three contain exactly the
same information:

"/usr/local/lib/R/site-library/nws/bin/RNWSSleighWorker.sh: 37:
/Library/Frameworks/R.framework/Resources/bin/R: not found"

Whats puzzling is the "Library/Frameworks/R.framework/Resources/bin/R"
part. That looks like an OS X-style path rather than Ubuntu-style. I
didn't specify that path anywhere, but it definitely exists on the
MacBook side. I notice that it occurs as the "RProg" value in the
message that's returned when I run the sleigh function; but I can't
include it in the sleigh function as an option, ie the following:

> s=sleigh(
+ 	nwsHost="172.30.xx.xx",
+ 	nwsPort=8765,
+ 	launch=sshcmd,
+ 	nodeList=c("10.85.xxx.xxx"),
+ 	scriptExec=envcmd,
+ 	scriptDir="/usr/local/lib/R/site-library/nws/bin",
+ 	scriptName="RNWSSleighWorker.sh",
+ 	workingDir='~/tmp/',
+ 	logDir='~/tmp/',
+ 	outfile="outfileTest",
+ 	user="tj",
+ 	RProg="/usr/bin/R",
+ 	verbose=TRUE)
Error in initialize(value, ...) : unused argument(s) RProg

(note that "/usr/bin/R" is the response I get when I type "which R" on
either the master or worker machine, so would expect that it would be
a valid value for RProg.)

***Section 5***
Further, the first thing that the "RNWSSleighWorker.sh" script does
is: RProg=${RProg:-'R'}, which just creates an environment variable on
the worker machine with the value $RProg=R. The first time $RProg is
used is on line 36. If I execute this line on the worker machine, I
get a blank R command prompt that responds to any command with another
blank prompt.

tj at clusterWorker1:~$ $RProg --vanilla --slave <<'EOF' > ${RSleighLogFile} 2>&1 &
>

I tried running the sleigh() function from the master after doing
this, thinking that it would activate the R as a slave on the worker
machine and and allow it to connect, but no luck.

At this point, I've run out of ideas on things to test. I appreciate
you if you've read all this, and I'll appreciate you even more if you
can give me some help. T

Thanks!
TJ Murphy



More information about the R-help mailing list