[Rd] data frame subset patch, take 2
Martin Maechler
maechler at stat.math.ethz.ch
Tue Dec 12 18:08:01 CET 2006
>>>>> "Marcus" == Marcus G Daniels <mgd at santafe.edu>
>>>>> on Tue, 12 Dec 2006 09:05:15 -0700 writes:
Marcus> Vladimir Dergachev wrote:
>> Here is the second iteration of data frame subset patch.
>> It now passes make check on both 2.4.0 and 2.5.0 (svn as
>> of a few days ago). Same speedup as before.
>>
Marcus> Hi,
Marcus> I was wondering if this patch would make it into the
Marcus> next release. I don't see it in SVN, but it's hard
Marcus> to be sure because the mailing list apparently
Marcus> strips attachments. If it isn't in, or going to be
Marcus> in, is this patch available somewhere else?
I was wondering too.
http://www.r-project.org/mail.html
explains what kind of attachments are allowed on R-devel.
I'm particularly interested, since during the last several days
I've made (somewhat experimental) changes to R-devel,
which makes some dealings with large data frames that have
"trivial rownames" (those represented as 1:nrow(.))
much more efficient.
Notably, as.matrix() of such data frames now no longer produces
huge row names, and e.g. dim(.) of such data frames has become
lightning fast [compared to what it was].
Some measurements:
N <- 1e6
set.seed(1)
## we round (for later dump().. reasons)
x <- round(rnorm(N),2)
y <- round(rnorm(N),2)
mOrig <- cbind(x = x, y = y)
df <- data.frame(x = x, y = y)
mNew <- as.matrix(df)
(sizes <- sapply(list(mOrig=mOrig, df=df, mNew=mNew), object.size))
## R-2.4.0 (64-bit):
## mOrig df mNew
## 16000520 16000776 72000560
## R-2.4.1 beta (32-bit):
## mOrig df mNew
## 16000296 16000448 52000320
## R-pre-2.5.0 (32-bit):
## mOrig df mNew
## 16000296 16000448 16000296
##------------------------------------
N <- 1e6
df <- data.frame(x = 0+ 1:N, y = 1+ 1:N)
system.time(for(i in 1:1000) d <- dim(df))
## R-2.4.1 beta (32-bit) [deb1]:
## [1] 1.920 3.748 7.810 0.000 0.000
## R-pre-2.5.0 (32-bit) [deb1]:
## user system elapsed
## 0.012 0.000 0.011
--- --- --- --- --- --- --- --- --- ---
However, currently
df[2,] ## still internally produces the character(1e6) row names!
something I think we should eliminate as well,
i.e., at least make sure that only seq_len(1e6) is internally
produced and not the character vector.
Note however that some of these changes are backward
incompatible. I do hope that the changes gaining efficiency
for such large data frames are worth some adaption of
current/old R source code..
Feedback on this topic is very welcome!
Martin
More information about the R-devel
mailing list