[R] Efficiently parallelize across columns of a data.table
Rebecca Payne
rebeccapayne at gmail.com
Fri Aug 19 20:22:15 CEST 2016
I am trying to parallelize a task across columns of a data.table using
foreach and doParallel. My data is large relative to my system memory
(about 40%) so I'm avoiding making any copies of the full table. The
function I am parallelizing is pretty simple, taking as input a fixed set
of columns and a single unique column per task. I'd like to send only the
small subset of columns actually needed to the worker nodes. I'd also like
the option to only send a subset of rows to the worker nodes. My initial
attempts to parallelize did not work as expected, and seemed to copy the
entire data.table to every worker node.
### start code ###
library(data.table)
library(foreach)
library(doParallel)
registerDoParallel()
anotherVar = "Y"
someVars = paste0("X", seq(1:20))
N = 100000000
# I've chosen N such that my Rsession consumes ~15GB of memory according to
top right after DT is created
DT = as.data.table(matrix(rnorm(21*N), ncol=21))
setnames(DT, c(anotherVar, someVars))
MyFun = function(inDT, inX, inY){
cor(inDT[[inX]], inDT[[inY]])
}
#Warning: Will throw an error on the mac GUI
corrWithY_1 = foreach(i = 1:length(someVars), .combine = c) %dopar%
MyFun(DT[,c(anotherVar, someVars[i]), with=FALSE], someVars[i],
anotherVar)
# Watching top, all of the slave nodes also appear to consume the full
~15Gb of system memory
gc()
# So I tried creating an entirely separate subset of DT to send to the
slave nodes, and then removing it by hand.
# This task, too, appears to take ~15GB of memory per slave node according
to top.
MyFun2 = function(DT, anotherVar, uniqueVar){
tmpData = DT[, c(anotherVar, uniqueVar), with=FALSE]
out = MyFun(tmpData, anotherVar, uniqueVar)
rm(tmpData)
return(out)
}
corrWithY_2 = foreach(i = 1:length(someVars), .combine = c) %dopar%
MyFun2(DT, anotherVar, someVars[i])
### end code ###
Another thing I've tried is to send only the name of DT and it's
environment to the slave nodes, but `get`doesn't seem to be able to only
get a subset of rows from DT, as I would need to do frequently
Questions:
1. Is top accurately reflecting my R session's memory usage?
2. If so, is there a way to parallelize over the columns of a data.table
without copying the entire table to every slave node?
[[alternative HTML version deleted]]
More information about the R-help
mailing list