[Rd] speeding up [.data.frame
Prof Brian Ripley
Mon, 7 Jan 2002 23:06:21 +0000 (GMT)
On Mon, 7 Jan 2002, Warnes, Gregory R wrote:
> > -----Original Message-----
> > From: Prof Brian Ripley [mailto:firstname.lastname@example.org]
> > Subject: Re: [Rd] speeding up [.data.frame
> > On Sat, 5 Jan 2002, Warnes, Gregory R wrote:
> > >
> > > (I'm up too late so this might come through garbled...)
> > >
> > > I've just been doing some bootstrapping on data frames
> > > and I discovered that S-plus 6.0r1 was a *lot* faster
> > > than R 1.3.1 at the task. Splus was completing 100 bootstrap
> > > iterations in about 4 seconds while R was taking
> > > about 15 seconds. However, doing bootstrapping on
> > > equivalent *matrices* R was slightly faster, 1.5 seconds verses 1.86.
> > >
> > > Now, since I'm doing glm's inside the bootstrap, I really
> > > need to use data frames...
> > Why? Surely you should be working at the design matrix
> > level and calling glm.fit directly? Otherwise you are repeating a lot
> > work for every bootstrap fit.
> Ahh, well, I usually start doing things the easy way and then work towards
> harder (and potentially more efficient) ones. I was just suprised about
> the overhead of bootstrapping a data frame when the operation inside the
> bootstrap was essentially a NOOP.
> Is there example code of calling glm fit somewhere?
glm! Take a look a package sm too, which has bootstrapping code for
People often don't realise how much work gets done creating model matrices
etc: profiling is often a very sobering exercise.
> > BTW, is 11 seconds worth saving?: it sound trivial to me.
> > But if it is,
> > moving to glm.fit looks to me to be the best optimization.
> No, 11 seconds is not worth much work. However, when I do 1000 bootstrap
> samples it just might be 110 seconds and if we throw this insite another
> loop to search over one of the parameters (which it looks like we may have
> to do) this gets multiplied again..
> > > It turns out that one of the reasons S-plus is faster on
> > > data frames is that S-Plus's allows you to turn of checking
> > > for/resolution of duplicate row names in "[.data.frame" by
> > > setting an attribute 'dup.row.names' to any non-NULL value.
> > > Adding an additional argument to R's "[.data.frame" (patch
> > > below) to permit the same optimization and using the argument in my
> > > bootstrap function reduced the elapsed time for R to 8.6 seconds.
> > >
> > > Still, I'm wondering if there are other 'reasonable' changes to
> > > "[.data.frame" that could narrow the gap further...
> > That one is not reasonable in my opinion. It should not be in S-PLUS
> > the advisory board has discussed its removal, as I recall). Having
> > row names is a fundamental property of data frames. What you and they
> > seem to want is another class which is like data frames but does not
> > require row names, from which data.frame could inherit.
> If there was another class that acted like a data frame but was 'lighter
> weight' that would do for my purpose.
> BTW, are the 'fundimental properties' of various S language objects defined
> anywhere? Surely, some properties are implementation accidents, others are
> implementation choices, and others are 'fundimental' to the behavior of the
No, although the blue and white books are the closest we have to a
definition. In this case, the first line of page 57 of the white book
says row names must be unique. It's exactly like relations (tables) in an
RDBMS: data frames correspond to relations with a primary key, and
database theory explains why that is a key (sorry) property. So I am sure
this was no accident.
Brian D. Ripley, email@example.com
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: firstname.lastname@example.org