[Rd] speeding up perception

Tim Hesterberg timhesterberg at gmail.com
Mon Jul 4 19:38:31 CEST 2011


I've written a "dataframe" package that replaces existing methods for
data frame creation and subscripting with versions that use less
memory.  For example, as.data.frame(a vector) makes 4 copies of the
data in R 2.9.2, and 1 copy with the package.  There is a small speed
gain.

I and others have been using it at Google for some years, and it is time
to either put it on CRAN, or move it into R.

R core folks - would you prefer that this be released to CRAN, or
would you like to consider merging it directly into R?

I took existing functions, and did some hacks to reduce the number of
times R copies objects.  Some of it is ugly.  This could be done more
efficiently, and with cleaner code, with some changes or hooks in R
internal code, but I'm not prepared to do that.

I often use lists instead of data frames.  In another package I have a
'subscriptRows' function that subscripts a list as if it were
a data frame.  I could merge that into the dataframe package.

Memory use - number of copies made
#                               R 2.9.2                 library(dataframe)
#       as.data.frame(y)        4                       1
#       data.frame(y)           8                       3
#       data.frame(y, z)        8                       3
#       as.data.frame(l)        10                      3
#       data.frame(l)           15                      5
#       d$z <- z                3,2                     1,1
#       d[["z"]] <- z           4,3                     2,1
#       d[, "z"] <- z           6,4,2                   2,2,1
#       d["z"] <- z             6,5,2                   2,2,1
#       d["z"] <- list(z=z)     6,3,2                   2,2,1
#       d["z"] <- Z #list(z=z)  6,2,2                   2,1,1
#       a <- d["y"]             2                       1
#       a <- d[, "y", drop=F]   2                       1
# y and z are vectors, Z and l are lists, and d a data frame.
# Where two numbers are given, they refer to:
#   (copies of the old data frame),
#   (copies of the new column)
# A third number refers to numbers of
#   (copies made of an integer vector of row names)

#                      -------  seconds (multiple repetitions) -------
#                      creation/column subscripting     row subscripting
# R 2.9.2            : 34.2 43.9 43.3                   10.6 13.0
# library(dataframe) : 22.5 21.8 21.8                    9.7  9.5  9.5

I reported one of the simpler hacks to this list earlier, and it
was included in some version of R after 2.9.2, so the current version
of R isn't as bad as 2.9.2.



More information about the R-devel mailing list