[Rd] efficiency and memory use of S4 data objects
John Chambers
jmc at research.bell-labs.com
Thu Aug 21 12:05:02 MEST 2003
The general question is certainly worth discussing, but I'd be surprised
if your example is measuring what you think it is.
The numeric computations are almost the only thing NOT radically changed
between your two examples. In the first, you are applying a
"primitive" set of functions ("+", "$", and "$<-") to a basic vector.
These functions go directly to C code, without creating a context (aka
frame) as would a call to an S-language function. In the second
example, the "+" will still be done in C code, with essentially no
change since the arguments will still be basic vectors.
Just about everything else, however, will be different. If you really
wanted to focus on the numeric computation, your second example would be
more relevant with the loop being
for(i in 1:iter)object at x <- object at x+1
In this case, the difference likely will be mainly the overhead for
"@<-", which is not a primitive. The example as written is adding a
layer of functions and method dispatch.
But the lesson we've learned over many years (and with S generally, not
specifically with methods and classes) is that empirical inference about
efficiency is a subtle thing (somehow as statisticians you'd think we
would expect that). Artifical examples have to be very carefully
designed and analysed before being taken at face value.
R has some useful tools, especially Rprof, to look at examples in the
hope of finding "hot spots". It would be good to see some results,
especially for realistic examples.
Anyway, on the general question.
1. Yes, there are lots of possibilities for speeding up method dispatch
& hopefully these will get a chance to be tried out, after 1.8. But I
would caution people expecting that method dispatch is the hot spot
_generally_. On a couple of occasions, it was because of introduced
glitches, and then the effect was obvious. There are some indirect
costs, such as creating a separate context for the method, and if these
are shown to be an issue, something might be done.
2. Memory use and the effect on garbage collection: Not too much has
been studied here & some good data would be helpful. (Especially if
some experts on storage management in R could offer advice.)
3. It might be more effective (and certainly more fun) to think of
changes in the context of "modernizing" some of the computations in R
generally. There have been several suggestions discussed that in
principle could speed up method/class computations, along with providing
other new features.
4. Meanwhile, the traditional S style that has worked well probably
applies. First, try out a variety of analyses taking advantage of
high-level concepts to program quickly. Then, when it's clear that
something needs to be applied extensively, try to identify critical
computations that could be mapped into lower-level versions (maybe even
C code), getting efficiency by giving up flexibility.
Regards,
John
Gordon Smyth wrote:
>
> I do lots of analyses on large microarray data sets so memory use and speed
> and both important issues for me. I have been trying to estimate the
> overheads associated with using formal S4 data objects instead of ordinary
> lists for large data objects. In some simple experiments (using R 1.7.1 in
> Windows 2000) with large but simple objects it seems that giving a data
> object a formal class definition and using extractor and assignment
> functions may increase both memory usage and the time taken by simple
> numeric operations by several fold.
>
> Here is a test function which uses a list representation to add 1 to the
> elements of a long numeric vector:
>
> addlist <- function(len,iter) {
> object <- list(x=rnorm(len))
> for (i in 1:iter) object$x <- object$x+1
> object
> }
>
> Typical times on my machine are:
>
> > system.time(a <- addlist(10^6,10))
> [1] 2.91 0.00 2.96 NA NA
> > system.time(addlist(10^7,10))
> [1] 28.03 0.44 28.65 NA NA
>
> Here is a test function doing the same operation with a formal S4 data
> representation:
>
> addS4 <- function(len,iter) {
> object <- new("MyClass",x=rnorm(len))
> for (i in 1:iter) x(object) <- x(object)+1
> object
> }
>
> The timing with len=10^6 increases to
>
> > system.time(a <- addS4(10^6,10))
> [1] 6.79 0.06 6.90 NA NA
>
> With len=10^7 the operation fails altogether due to insufficient memory
> after thrashing around with virtual memory for a very long time.
>
> I guess I'm not surprised by the performance penalty with S4. My question
> is: is the performance penalty likely to be an ongoing feature of using S4
> methods or will it likely go away in future versions of R?
>
> Thanks
> Gordon
>
> Here are my S4 definitions:
>
> setClass("MyClass",representation(x="numeric"))
> setGeneric("x",function(object) standardGeneric("x"))
> setMethod("x","MyClass",function(object) object at x)
> setGeneric("x<-", function(object, value) standardGeneric("x<-"))
> setReplaceMethod("x","MyClass",function(object,value) {object at x <- value;
> return(object)})
>
> > version
> _
> platform i386-pc-mingw32
> arch i386
> os mingw32
> system i386, mingw32
> status
> major 1
> minor 7.1
> year 2003
> month 06
> day 16
> language R
>
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
--
John M. Chambers jmc at bell-labs.com
Bell Labs, Lucent Technologies office: (908)582-2681
700 Mountain Avenue, Room 2C-282 fax: (908)582-3340
Murray Hill, NJ 07974 web: http://www.cs.bell-labs.com/~jmc
More information about the R-devel
mailing list