Dear R-users,
I am looking for a solution to "parallelize" my PLSR predictions in order to save processing time. I was trying to use the "foreach" construct with "doPar" (cf. 2nd part of code below), but I was unable to allocate the predicted values and the model performance parameters (RMSEP) to the output variable (all in the 2nd part).
My code:
set.seed(10000) # generate some data...
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method = "euclidean") # distance matrix to find close samples
eDm <- as.matrix(eD)
kns <- matrix(NA,nrow(x),10) # empty matrix to allocate 10 closest samples
for (i in 1:nrow(eDm)) { # identify closest samples in a loop and allocate to kns kns[i,] <- head(order(eDm[,i]), 11)[-1]
}
So far I consider the code as "safe", but the next part is challenging me, since I never used the "foreach" construct before:
library(pls) library(foreach) library(doParallel) cl <- makeCluster(2) registerDoParallel(cl) out <- foreach(j = 1:nrow(mat), .combine="rbind", .packages="pls") %dopar% { pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,]) predict(pls, ncomp=5, newdata=x[j,,drop=F]) RMSEP(pls, estimate="CV")$val[1,1,5] } stopCluster(cl)
As I understand, the 3rd-to-last code line starting with "RMSEP(pls,..." is simply overwriting the previously written data from the "predict" code line. Somehow I was assuming the
.combine option would take care of this?
Many thanks for your help!
Best, Chega
[[alternative HTML version deleted]]