[Rd] speeding up [.data.frame

Warnes, Gregory R gregory_r_warnes@groton.pfizer.com
Mon, 7 Jan 2002 17:01:00 -0500

 >  -----Original Message-----
 >  From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk]
 >  Subject: Re: [Rd] speeding up [.data.frame
 >  On Sat, 5 Jan 2002, Warnes, Gregory R wrote:
 >  >
 >  > (I'm up too late so this might come through garbled...)
 >  >
 >  > I've just been doing some bootstrapping on data frames 
 >  > and I discovered that S-plus 6.0r1 was a *lot* faster 
 >  > than R 1.3.1 at the task. Splus was completing 100 bootstrap 
 >  > iterations in about 4 seconds while R was taking
 >  > about 15 seconds. However, doing bootstrapping on 
 >  > equivalent *matrices* R was slightly faster, 1.5 seconds verses 1.86.
 >  >
 >  > Now, since I'm doing glm's inside the bootstrap, I really 
 >  > need to use data frames...
 >  Why?  Surely you should be working at the design matrix 
 >  level and calling glm.fit directly?   Otherwise you are repeating a lot
 >  work for every bootstrap fit.

Ahh, well, I usually start doing things the easy way and then work towards
harder (and potentially more efficient) ones.  I was  just suprised about
the overhead of bootstrapping a data frame when the operation inside the
bootstrap was essentially a NOOP.  

Is there example code of calling glm fit somewhere?

 >  BTW, is 11 seconds worth saving?: it sound trivial to me.  
 >  But if it is,
 >  moving to glm.fit looks to me to be the best optimization.

No, 11 seconds is not worth much work.  However, when I do 1000 bootstrap
samples it just might be 110 seconds and if we throw this insite another
loop to search over one of the parameters (which it looks like we may have
to do) this gets multiplied again..

 >  > It turns out that one of the reasons S-plus is faster on 
 >  > data frames is that S-Plus's allows you to turn of checking 
 >  > for/resolution of duplicate row names in "[.data.frame" by 
 >  > setting an attribute 'dup.row.names' to any non-NULL value.  
 >  > Adding an additional argument to R's "[.data.frame"  (patch
 >  > below) to permit the same optimization and using the argument in my
 >  > bootstrap function reduced the elapsed time for R to 8.6 seconds.
 >  >
 >  > Still, I'm wondering if there are other 'reasonable' changes to
 >  > "[.data.frame" that could narrow the gap further...
 >  That one is not reasonable in my opinion.  It should not be in S-PLUS
 >  the advisory board has discussed its removal, as I recall). Having
 >  row names is a fundamental property of data frames.  What you and they
 >  seem to want is another class which is like data frames but does not
 >  require row names, from which data.frame could inherit.

If there was another class that acted like a data frame but was 'lighter
weight' that would do for my purpose.  

BTW, are the 'fundimental properties' of various S language objects defined
anywhere?  Surely, some properties are implementation accidents, others are
implementation choices, and others are 'fundimental' to the behavior of the

 >  -- 
 >  Brian D. Ripley,                  ripley@stats.ox.ac.uk
 >  Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 >  University of Oxford,             Tel:  +44 1865 272861 (self)
 >  1 South Parks Road,                     +44 1865 272860 (secr)
 >  Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Unless expressly stated otherwise, this message is confidential and may be privileged. It is intended for the addressee(s) only. Access to this E-mail by anyone else is unauthorized. If you are not an addressee, any disclosure or copying of the contents of this E-mail or any action taken (or not taken) in reliance on it is unauthorized and may be unlawful. If you are not an addressee, please inform the sender immediately.
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch