[R] Snow and multi-processing

Blanchette, Marco MAB at stowers-institute.org
Sun Nov 30 01:59:08 CET 2008


Dear R gurus,

I have a very embarrassingly parallelizable job that I am trying to speed up with snow on our local cluster. Basically, I am doing ~50,000 t.test for a series of micro-array experiments, one gene at a time. Thus, I can easily spread the load across multiple processors and nodes.

So, I have a master list object that tells me what rows to pick up for each genes to do the t.test from series of microarray experiments containing ~500,000 rows and x columns per experiments.

While trying to optimize my function using parLapply(), I quickly realized that I was not gaining any speed because every time a test was done on one of the item in the list, the 500,000 line by x column matrix had to be shipped along with the item in the list and the traffic time was actually longer than the computing time.

However, if I export the 500,000 object first across the spawned processes as in this mock script

cl <- makeCluster(nnodes,method)
mArrayData <- getData(experiments)
clusterExport(cl, 'mArrayData')

Results <- parLapply(cl, theMapList, function(x) t.testFnc(x))

With a function that define the mArrayData argument as a default parameter as in

t.testFnc <- function(probeList, array=mArrayData){
    x <- array[probeList$A,]
    y <- array[probeList$B,]
     res <- doSomeTest(x,y)
    return(res)
}

Using this strategy, I was able to gain full advantage of my cluster and reduce the analysis time by the number of nodes I have in our cluster. The large data matrix was resident in each processes and didn't have to travel on the network every time a item from the list was pass to the function t.testFnc()

However, I quickly realized that this works (the call to clusterExport() ) only when I run the script one line at a time. When the process is enclosed in a function, the object mArrayData is not exported, presumably because it's not a global object from the Master process.

So, what is the alternative to push the content of an object to the slaves? The documentation in the snow package is a bit light and I couldn't find good example out there. I don't want to have the function getData() evaluated on each nodes because the argument to that functions are humongous and that would cause way too much traffic on the network. I want the result of the function getData(), the object mArrayData, propagated to the cluster only once and be available to downstream functions.

Hope this is clear and that a solution will be possible.

Many thanks

Marco

--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.

Kansas City, MO 64110

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018



More information about the R-help mailing list