[R] Parallel package guidance needed
Therneau, Terry M., Ph.D.
therneau at mayo.edu
Tue Jan 17 03:19:48 CET 2017
I have a process that I need to parallelize, and have a question about two
different ways to proceed. It is essentially an MCMC exploration where
the likelihood is a sum over subjects (6000 of them), and the per-subject
computation is the slow part.
Here is a rough schematic of the code using one approach:
mymc <- function(formula, data, subset, na.action, id, etc) {
# lots of setup, long but computationally quick
hlog <- function(thisid, param) {
# compute the loglik for this subject
...
}
uid <- unique(id) # multiple data rows for each subject
for (i in 1:burnin) {
param <- get_next_proposal()
loglist <- mclapply(uid, hlog, param=param)
loglik <- sum(unlist(loglist))
# process result
}
# Now the non-burnin MCMC iterations
}
The second approach is to put cluster formation outside the loop, e.g.,
...
clust <- makeForkCluster()
for (i in 1:burnin) {
param <- get_next_proposal()
loglist <- parLapply(clust, uid, hlog, param=param)
loglik <- sum(unlist(loglist))
# process result
}
# rest of the code
stopCluster(clust)
------------------
On the face of it, the second looks like it "could" be more efficient since it
only starts and stops the subprocesses once. A short trial on one of our
cluster servers seems to say the opposite. The load average on a quiet machine
never gets much over 5-6 using method 2, and in the 20s for method 1
(detectCores() =80 on the box, we used mc.cores=50). Wall time for method 2
is looking to be several hours.
Any pointers to documentation/discussion at this level would be much appreciated. I'm going to be fitting a lot of models.
Terry T.
[[alternative HTML version deleted]]
More information about the R-help
mailing list