[R] Snow and multi-processing
Blanchette, Marco
MAB at stowers-institute.org
Sun Nov 30 01:59:08 CET 2008
Dear R gurus,
I have a very embarrassingly parallelizable job that I am trying to speed up with snow on our local cluster. Basically, I am doing ~50,000 t.test for a series of micro-array experiments, one gene at a time. Thus, I can easily spread the load across multiple processors and nodes.
So, I have a master list object that tells me what rows to pick up for each genes to do the t.test from series of microarray experiments containing ~500,000 rows and x columns per experiments.
While trying to optimize my function using parLapply(), I quickly realized that I was not gaining any speed because every time a test was done on one of the item in the list, the 500,000 line by x column matrix had to be shipped along with the item in the list and the traffic time was actually longer than the computing time.
However, if I export the 500,000 object first across the spawned processes as in this mock script
cl <- makeCluster(nnodes,method)
mArrayData <- getData(experiments)
clusterExport(cl, 'mArrayData')
Results <- parLapply(cl, theMapList, function(x) t.testFnc(x))
With a function that define the mArrayData argument as a default parameter as in
t.testFnc <- function(probeList, array=mArrayData){
x <- array[probeList$A,]
y <- array[probeList$B,]
res <- doSomeTest(x,y)
return(res)
}
Using this strategy, I was able to gain full advantage of my cluster and reduce the analysis time by the number of nodes I have in our cluster. The large data matrix was resident in each processes and didn't have to travel on the network every time a item from the list was pass to the function t.testFnc()
However, I quickly realized that this works (the call to clusterExport() ) only when I run the script one line at a time. When the process is enclosed in a function, the object mArrayData is not exported, presumably because it's not a global object from the Master process.
So, what is the alternative to push the content of an object to the slaves? The documentation in the snow package is a bit light and I couldn't find good example out there. I don't want to have the function getData() evaluated on each nodes because the argument to that functions are humongous and that would cause way too much traffic on the network. I want the result of the function getData(), the object mArrayData, propagated to the cluster only once and be available to downstream functions.
Hope this is clear and that a solution will be possible.
Many thanks
Marco
--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.
Kansas City, MO 64110
Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018
More information about the R-help
mailing list