[R-sig-hpc] snow and foreach memory issue?

Sun Dec 13 18:50:07 CET 2009

Zhang, Ivan wrote:
> Hi everyone,
> 
> I have a question regarding ram usage of snow and foreach. 
> I am running windows XP with 4 gb ram and intel quadcore 2.66.
> 
> I recently tried to implement multicore processing using 'multicore' and
> 'foreach' until I realized that multicore didn't work well on Windows
> and switched to using 'snow' and 'foreach' which works nicely.
> 
> I hashed out my own method without dopar however, for some reason I was
> eating up my ram really quickly.
> I wanted to see if anyone can figure out why perhaps there is something
> I don't understand about snow as I just recently started using it.
> 
> Suppose X_mat is a series of regressors for Y.
> 
> Datapoints is a very large matrix of points.
> 
> The pseudo code is as follows: 
> someFunc = function(cl,...) {
> 
> for ( i in 1:n) {
>     model = generateModel(Y[,i],X_mat)
>     dp=iter(datapoints, by='row',
> chunksize=floor(nrow(datapoints)/NUMCORES))
> 
>     assign("model", model, .GlobalEnv) 		#couldn't figure out how
> to get cluster export to work within a function.
>     clusterExport(cl, "model") 			#this only reads from
> globalenv?
> 
>     pred <- do.call(c, clusterApply(cl, as.list(dp), function(x)

as.list(iter(datapoints, <etc>)) makes a full copy of datapoints.
clusterApply creates another copy, distributed across cores. So if
they're on the same machine you now use 3 * sizeof(datapoints) memory,
even before any calculations done on the workers. Ouch.

A first approach might iter(datapoints, chunksize=nrow(datapoints) / N)
coupled with clusterApplyLB -- there will be N chunks (assuming N >
NUMCORES), and clusterApplyLB will only ever have in play a portion
NUMCORES / N of datapoints (clusterApply would divide the N chunks into
two groups, again forwarding the entire data to the workers!), so the
memory use will be (2 + NUMCORES / N) * sizeof(datapoints).

A better approach is to avoid the duplication implied by
as.list(iter(<etc>)). This would require an implementation like
snow::dynamicClusterApply, where the first NUMCORES chunks of iter() are
forwarded to the workers, and then a loop is entered where the manager
receives one result and forwards the next chunk to the worker that
provided the result. Memory use would then be (1 + NUMCORES / N) *
sizeof(datapoints). Presumably this is the strategy taken by %dopar%.

multicore should be the winner here, though, since all workers should
have access to the data without copying -- sizeof(datapoints) memory
use. I haven't used multicore extensively, and especially not on
windows. When you say "it didn't work well" it would be helpful to
understand why. My limited experimentation suggested no problems when
used with data sets that were not too close to the windows memory
limits. Perhaps you are really just running out of memory, and multicore
is not reporting this as nicely as it could? I'm sure the multicore
author would appreciate something more precise in terms of user experience.

A final consideration is that calculations on the workers are likely to
duplicate a subset of datapoints, so that actual memory use will include
an additional component that scales approximately linearly with
NUMCORES. If the worker computations are memory intensive, then you'll
quickly find yourself in trouble again.

Hope that helps,

Martin

> predict(model, x))
> 	
> 	...
> }
> 
> }
> 
> When I ran this code, my ram would go up from the initial 2.5
> incrementally up 500 mb after each run until it ate up 4 GB. So for
> n>=3, the computer would throttle.
> 
> When I found out about the new registerDoSnow, it improved my
> performance (props to Stephen Weston) here's the pseudo code for the
> equivalent above.
> 
> someFunc = function(cl,...) {
> registerDoSnow(cl)
> for ( i in 1:n) {
>     	model = generateModel(Y[,i],X_mat)
> 	pred <- foreach(dp=iter(datapoints, by='row',
> chunksize=floor(nrow(datapoints)/NUMCORES)), .combine=c, .verbose=TRUE)
> %dopar% { predict(Tail.lo,dp) })}
> }
> 
> Aside from slightly shorter lines, the performance was more stable, the
> ram ran from 2.5 up to 3.2 GB ish, and stayed stable, and performed
> better because it wouldn't run out of cache.
> 
> However, I just want to understand what is the difference between the
> two treatments so that it would make such a large difference and whether
> I am doing something wrong in the first example.
> 
> Thanks,
> 
> -Ivan
> 
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793