[Rd] document environment passing in parallel::parLapply
Gabe Newell
payday2stats at gmail.com
Mon Dec 11 18:45:33 CET 2017
The runtime of parallel::parLapply depends on variables unrelated to
the parLapply call. However, this is not clearly documented. Therefore
I would like to suggest expanding the relevant documentation to
explain this behaviour.
Consider this example:
parallel_demo <- function(random_values_count) {
some_data <- runif(random_values_count)
dummy_function <- function(x) {
x
}
cluster <- parallel::makeCluster(3)
start <- proc.time()
parallel::parLapply(cluster, 1:3, dummy_function)
runtime <- proc.time() - start
parallel::stopCluster(cluster)
runtime
}
parallel_demo(10)
parallel_demo(100 * 1000 * 1000)
On my machine, this results in a measured runtime of 0.01 seconds
being returned for the first call to parallel_demo, but in a runtime
of 7.04 seconds being returned for the second call.
I could not find clear documentation in either ?parallel::parLapply or
vignette("parallel", package = "parallel") - or any other obvious
place - on what is the reason for the demonstrated difference in
runtime.
Based on the observations described above (and on lots of additional
tests), my _assumption_ is that parallel::parLapply passes the whole
environment of its "fun" argument to all cluster nodes, which of
course takes some time. Thus the more data there is in this
environment, the longer this takes, even though the environment data
might not be needed to execute the function "fun".
For environments with lots of data in them, this can considerably slow
down the computation at hand. At the same time, this behaviour of
passing all data in the environment of "fun" to the cluster nodes is
not clearly documented. The only - rather vague - hint that I found
about this is in the "extended examples" section (specifically on page
13, in section 10.4) of vignette("parallel", package = "parallel").
Furthermore, this behaviour is not something that would very easily be
expected by every R user, in my opinion. Therefore I want to suggested
expanding the documentation of parallel::parLapply so that it
explicitely states that the environment of "fun" has to be passed to
all cluster nodes, which may take some time.
I spent a considerable amount of time on figuring out why my
parallelization code didn't really speed up my calculations, and I
would like to save others from going through this hassle again. :-)
For the sake of completeness, here is my session info:
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.3
year 2017
month 11
day 30
svn rev 73796
language R
version.string R version 3.4.3 (2017-11-30)
nickname Kite-Eating Tree
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.3 parallel_3.4.3 tools_3.4.3 yaml_2.1.14
Martin
More information about the R-devel
mailing list