[Rd] speeding up perception
Tim Hesterberg
timhesterberg at gmail.com
Mon Jul 4 19:38:31 CEST 2011
I've written a "dataframe" package that replaces existing methods for
data frame creation and subscripting with versions that use less
memory. For example, as.data.frame(a vector) makes 4 copies of the
data in R 2.9.2, and 1 copy with the package. There is a small speed
gain.
I and others have been using it at Google for some years, and it is time
to either put it on CRAN, or move it into R.
R core folks - would you prefer that this be released to CRAN, or
would you like to consider merging it directly into R?
I took existing functions, and did some hacks to reduce the number of
times R copies objects. Some of it is ugly. This could be done more
efficiently, and with cleaner code, with some changes or hooks in R
internal code, but I'm not prepared to do that.
I often use lists instead of data frames. In another package I have a
'subscriptRows' function that subscripts a list as if it were
a data frame. I could merge that into the dataframe package.
Memory use - number of copies made
# R 2.9.2 library(dataframe)
# as.data.frame(y) 4 1
# data.frame(y) 8 3
# data.frame(y, z) 8 3
# as.data.frame(l) 10 3
# data.frame(l) 15 5
# d$z <- z 3,2 1,1
# d[["z"]] <- z 4,3 2,1
# d[, "z"] <- z 6,4,2 2,2,1
# d["z"] <- z 6,5,2 2,2,1
# d["z"] <- list(z=z) 6,3,2 2,2,1
# d["z"] <- Z #list(z=z) 6,2,2 2,1,1
# a <- d["y"] 2 1
# a <- d[, "y", drop=F] 2 1
# y and z are vectors, Z and l are lists, and d a data frame.
# Where two numbers are given, they refer to:
# (copies of the old data frame),
# (copies of the new column)
# A third number refers to numbers of
# (copies made of an integer vector of row names)
# ------- seconds (multiple repetitions) -------
# creation/column subscripting row subscripting
# R 2.9.2 : 34.2 43.9 43.3 10.6 13.0
# library(dataframe) : 22.5 21.8 21.8 9.7 9.5 9.5
I reported one of the simpler hacks to this list earlier, and it
was included in some version of R after 2.9.2, so the current version
of R isn't as bad as 2.9.2.
More information about the R-devel
mailing list