[R] Re: [S] scalability

Patrick Burns pburns at pburns.seanet.com
Sat Mar 27 20:24:33 CET 2004


I think this is an interesting discussion -- I've learned from both
Steve's and Brian's comments, and I'm broadening it to R-help
since I think others will be interested as well.

The problem up for comment is:

result <- apply(array.3D, 1:2, sum)

Where array.3D is 3000 by 300 by 3.

The original poster already had a perfectly good replacement for
this problem that was virtually instantaneous.  A solution for this
particular problem is not the issue, it is merely the starting point
for cases where there wouldn't be a trivial workaround.

Steve Karmesin wrote:
 
SK> As others have said, what apply has to do in this case is loop over 
the 900,000
SK> cases and do a  'sum' over three elements each time.  In this case 
the overhead
SK> of calling an S+ function totally swamps the numeric operations.
SK>
SK> Doing this on smaller datasets (300x30x3) on my machine (2CPU, 3GHz 
Xeon
SK> running Windows 2000 and S-Plus 6.1) shows an overhead of about 140
SK> microseconds per call to sum, so I would expect it to take 
100*1e-6*9e5=90 seconds.
SK>
SK> The thing is, it is worse than this.  If I do a case with 900x90x3 
it takes 300 usec per 'sum'.
SK>
SK> R is fairly stable at just under 15usec per 'sum' on my machine.
SK>
SK> A little more investigation (together with office mate Tony Plate) 
provides some insight.
SK>
SK> Using mem.tally.reset() and mem.tally.report() shows that for this 
case it is allocating a
SK> whopping 1280 bytes for each call to 'sum'.
SK>
SK> Just touching that much memory is going to be slow.  So why would it 
do that?  Looking
SK> at the definition of the apply function shows that it is allocating 
a general list for the result,
SK> not a vector-based array or matrix.
SK>
SK> Why?  It has a shortcut that lets it use efficient matrices if the 
input is a 2D matrix, but this
SK> one is 3D, so it uses the general code, which is much, much slower 
and uses a lot more memory.
SK>
SK> If you collapse the first two dimensions of the array the times are 
stable at <80usec per
SK> call to sum and it allocates 8 bytes per call, which is just the 
amount of space needed.
SK>
SK> Still, the R code seems to always build a list, and it is about 
15usec per call. Somehow
SK> the underlying function call and perhaps list storage mechanisms are 
more efficient there.

Prof Brian Ripley wrote:

BR> There are almost always pros and cons with these issues.  S's sum() is an 
BR> S4 generic whereas R's is internal *unless* you define an S4 method for 
BR> it (which S-PLUS has already done).  S needs to create several frames for 
BR> what is a nested set of function calls -- 1280b looks modest for that.
BR> 
BR> Also, S has an ability to back out calculations that R does not, and that 
BR> costs memory (and can have benefits).
BR> 
BR> We know there are overheads in making functions generic, especially 
BR> S4-generic, but then there are benefits too.  I am not sure designers who 
BR> add features take enough account of the costs.

Using R 1.8.1 (precompiled) on SuSe Linux with a Xeon 2.4GHz and 1G of 
memory:

set.seed(2)
jja <- array(rnorm(3000*300*3), c(3000, 300, 3))
gc()
system.time(jjsa <- apply(jja, 1:2, sum)) # takes 30 seconds

sumS3 <- function(x, ...) UseMethod("sumS3")
sumS3.default <- function(x, ...) sum(x, ...)
gc()
system.time(jjsa3 <- apply(jja, 1:2, sumS3)) # takes 65 seconds

sumS4 <- function(x, ...) standardGeneric("sumS4")
setMethod("sumS4", signature(x="numeric"), function(x, ...) sum(x, ...))
gc()
system.time(jjsa4 <- apply(jja, 1:2, sumS4)) # takes 58 seconds

Questions:

It looks to me like the penalty for making the functions generic is
similar to one extra function call.  Does the penalty grow as there
are more methods?  Are there other types of penalties for making
a function generic?

Is the test with sumS4 still an unfair comparison with S-PLUS?

Are things better with S-PLUS 6.2?

Patrick Burns

Burns Statistics
patrick at burns-stat.com
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of S Poetry and "A Guide for the Unwilling S User")




More information about the R-help mailing list