[Rd] data frame subset patch, take 2
Robert Gentleman
rgentlem at fhcrc.org
Tue Dec 12 18:41:50 CET 2006
Hi,
I tried take 1, and it failed. I have been traveling (and with
Martin's changes also waiting for things to stabilize) before trying
take 2, probably later this week and I will send an email if it goes in.
Anyone wanting to try it and run R through check and check-all is
welcome to do so and report success or failure.
best wishes
Robert
Martin Maechler wrote:
>>>>>> "Marcus" == Marcus G Daniels <mgd at santafe.edu>
>>>>>> on Tue, 12 Dec 2006 09:05:15 -0700 writes:
>
> Marcus> Vladimir Dergachev wrote:
> >> Here is the second iteration of data frame subset patch.
> >> It now passes make check on both 2.4.0 and 2.5.0 (svn as
> >> of a few days ago). Same speedup as before.
> >>
> Marcus> Hi,
>
> Marcus> I was wondering if this patch would make it into the
> Marcus> next release. I don't see it in SVN, but it's hard
> Marcus> to be sure because the mailing list apparently
> Marcus> strips attachments. If it isn't in, or going to be
> Marcus> in, is this patch available somewhere else?
>
> I was wondering too.
> http://www.r-project.org/mail.html
> explains what kind of attachments are allowed on R-devel.
>
> I'm particularly interested, since during the last several days
> I've made (somewhat experimental) changes to R-devel,
> which makes some dealings with large data frames that have
> "trivial rownames" (those represented as 1:nrow(.))
> much more efficient.
>
> Notably, as.matrix() of such data frames now no longer produces
> huge row names, and e.g. dim(.) of such data frames has become
> lightning fast [compared to what it was].
>
> Some measurements:
>
> N <- 1e6
> set.seed(1)
> ## we round (for later dump().. reasons)
> x <- round(rnorm(N),2)
> y <- round(rnorm(N),2)
> mOrig <- cbind(x = x, y = y)
> df <- data.frame(x = x, y = y)
> mNew <- as.matrix(df)
> (sizes <- sapply(list(mOrig=mOrig, df=df, mNew=mNew), object.size))
> ## R-2.4.0 (64-bit):
> ## mOrig df mNew
> ## 16000520 16000776 72000560
>
> ## R-2.4.1 beta (32-bit):
> ## mOrig df mNew
> ## 16000296 16000448 52000320
>
> ## R-pre-2.5.0 (32-bit):
> ## mOrig df mNew
> ## 16000296 16000448 16000296
>
> ##------------------------------------
>
> N <- 1e6
> df <- data.frame(x = 0+ 1:N, y = 1+ 1:N)
> system.time(for(i in 1:1000) d <- dim(df))
>
> ## R-2.4.1 beta (32-bit) [deb1]:
> ## [1] 1.920 3.748 7.810 0.000 0.000
>
> ## R-pre-2.5.0 (32-bit) [deb1]:
> ## user system elapsed
> ## 0.012 0.000 0.011
>
>
> --- --- --- --- --- --- --- --- --- ---
>
> However, currently
>
> df[2,] ## still internally produces the character(1e6) row names!
>
> something I think we should eliminate as well,
> i.e., at least make sure that only seq_len(1e6) is internally
> produced and not the character vector.
>
> Note however that some of these changes are backward
> incompatible. I do hope that the changes gaining efficiency
> for such large data frames are worth some adaption of
> current/old R source code..
>
> Feedback on this topic is very welcome!
>
> Martin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
More information about the R-devel
mailing list