[R] Efficiently parallelize across columns of a data.table

Fri Aug 19 20:22:15 CEST 2016

I am trying to parallelize a task across columns of a data.table using
foreach and doParallel. My data is large relative to my system memory
(about 40%) so I'm avoiding making any copies of the full table. The
function I am parallelizing is pretty simple, taking as input a fixed set
of columns and a single unique column per task. I'd like to send only the
small subset of columns actually needed to the worker nodes. I'd also like
the option to only send a subset of rows to the worker nodes. My initial
attempts to parallelize did not work as expected, and seemed to copy the
entire data.table to every worker node.

### start code ###

library(data.table)

library(foreach)

library(doParallel)

registerDoParallel()

anotherVar = "Y"

someVars = paste0("X", seq(1:20))

N = 100000000

# I've chosen N such that my Rsession consumes ~15GB of memory according to
top right after DT is created

DT = as.data.table(matrix(rnorm(21*N), ncol=21))

setnames(DT, c(anotherVar, someVars))

MyFun = function(inDT, inX, inY){

  cor(inDT[[inX]], inDT[[inY]])

}

#Warning: Will throw an error on the mac GUI

corrWithY_1 = foreach(i = 1:length(someVars), .combine = c) %dopar%

  MyFun(DT[,c(anotherVar, someVars[i]), with=FALSE], someVars[i],
anotherVar)

# Watching top, all of the slave nodes also appear to consume the full
~15Gb of system memory

gc()

# So I tried creating an entirely separate subset of DT to send to the
slave nodes, and then removing it by hand.

# This task, too, appears to take ~15GB of memory per slave node according
to top.

MyFun2 = function(DT, anotherVar, uniqueVar){

  tmpData = DT[, c(anotherVar, uniqueVar), with=FALSE]

  out = MyFun(tmpData, anotherVar, uniqueVar)

  rm(tmpData)

  return(out)

}

corrWithY_2 = foreach(i = 1:length(someVars), .combine = c) %dopar%

  MyFun2(DT, anotherVar, someVars[i])

### end code ###

Another thing I've tried is to send only the name of DT and it's
environment to the slave nodes, but `get`doesn't seem to be able to only
get a subset of rows from DT, as I would need to do frequently

Questions:

1. Is top accurately reflecting my R session's memory usage?

2. If so, is there a way to parallelize over the columns of a data.table
without copying the entire table to every slave node?

	[[alternative HTML version deleted]]