[Rd] Lightweight data frame class
vograno at evafunds.com
Fri Nov 26 00:31:07 CET 2004
As far as I can tell data.frame class adds two features to those of
* matrix structure via [,] and [,]<- operators (well, I know these are
actually "["(i, j, ...), not "[,]").
* row names attribute.
It seems that the overhead of the support for the row names, both
computational and RAM-wise, is rather non-trivial. I frequently
subscript from a data.frame, i.e. use [,] on data frames, and my timing
shows that the equivalent list operation is about 7 times faster, see
On the other hand, at least in my usage pattern, I really rarely benefit
from the row names attribute, so as far as I am concerned row names is
just an overhead. (Of course the speed difference may be due to other
factors, the only thing I can tell is that subscripting is very slow in
data frames relative to in lists).
I thought of writing a new class, say lightweight.data.frame, that would
be polymorphic with the existing data.frame class. The class would
inherit from "list" and implement [,], [,]<- operators. It would also
implement the "rownames" function that would return seq(nrow(x)), etc.
It should also implement as.data.frame to avoid the overhead of
conversion to a full-blown data.frame in calls like lm(y ~ x,
Has anyone thought of this? Can you see any potential problems?
P.S. These are the timing results comparing data.frame operations to
those of lists
# make a 1e6 * 5 list
> system.time(x <- lapply(seq(5), function(x) rnorm(1e6)))
 4.46 0.10 4.57 0.00 0.00
# convert it to a data.frame
> system.time(y <- as.data.frame(x))
 49.17 1.25 50.61 0.00 0.00
# do an equivalent of x[-1,] on the list
> i <- seq(2, nrow(y)); system.time(x.sub <- lapply(x, function(x)
 0.19 0.15 0.35 0.00 0.00
# do an equivalent of x[-1,] on the data.frame
> i <- seq(2, nrow(y)); system.time(y.sub <- y[i,])
 2.08 0.56 2.64 0.00 0.00
More information about the R-devel