[Rd] speeding up [.data.frame

Mon, 7 Jan 2002 23:06:21 +0000 (GMT)

On Mon, 7 Jan 2002, Warnes, Gregory R wrote:

>
>  >  -----Original Message-----
>  >  From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk]
>  >  Subject: Re: [Rd] speeding up [.data.frame
>  >
>  >
>  >  On Sat, 5 Jan 2002, Warnes, Gregory R wrote:
>  >
>  >  >
>  >  > (I'm up too late so this might come through garbled...)
>  >  >
>  >  > I've just been doing some bootstrapping on data frames
>  >  > and I discovered that S-plus 6.0r1 was a *lot* faster
>  >  > than R 1.3.1 at the task. Splus was completing 100 bootstrap
>  >  > iterations in about 4 seconds while R was taking
>  >  > about 15 seconds. However, doing bootstrapping on
>  >  > equivalent *matrices* R was slightly faster, 1.5 seconds verses 1.86.
>  >  >
>  >  > Now, since I'm doing glm's inside the bootstrap, I really
>  >  > need to use data frames...
>  >
>  >  Why?  Surely you should be working at the design matrix
>  >  level and calling glm.fit directly?   Otherwise you are repeating a lot
> of
>  >  work for every bootstrap fit.
>
> Ahh, well, I usually start doing things the easy way and then work towards
> harder (and potentially more efficient) ones.  I was  just suprised about
> the overhead of bootstrapping a data frame when the operation inside the
> bootstrap was essentially a NOOP.
>
> Is there example code of calling glm fit somewhere?

glm!   Take a look a package sm too, which has bootstrapping code for
glms.

People often don't realise how much work gets done creating model matrices
etc: profiling is often a very sobering exercise.

>  >  BTW, is 11 seconds worth saving?: it sound trivial to me.
>  >  But if it is,
>  >  moving to glm.fit looks to me to be the best optimization.
>
> No, 11 seconds is not worth much work.  However, when I do 1000 bootstrap
> samples it just might be 110 seconds and if we throw this insite another
> loop to search over one of the parameters (which it looks like we may have
> to do) this gets multiplied again..
>
>  >
>  >  > It turns out that one of the reasons S-plus is faster on
>  >  > data frames is that S-Plus's allows you to turn of checking
>  >  > for/resolution of duplicate row names in "[.data.frame" by
>  >  > setting an attribute 'dup.row.names' to any non-NULL value.
>  >  > Adding an additional argument to R's "[.data.frame"  (patch
>  >  > below) to permit the same optimization and using the argument in my
>  >  > bootstrap function reduced the elapsed time for R to 8.6 seconds.
>  >  >
>  >  > Still, I'm wondering if there are other 'reasonable' changes to
>  >  > "[.data.frame" that could narrow the gap further...
>  >
>  >  That one is not reasonable in my opinion.  It should not be in S-PLUS
> (and
>  >  the advisory board has discussed its removal, as I recall). Having
> unique
>  >  row names is a fundamental property of data frames.  What you and they
>  >  seem to want is another class which is like data frames but does not
>  >  require row names, from which data.frame could inherit.
>
> If there was another class that acted like a data frame but was 'lighter
> weight' that would do for my purpose.
>
> BTW, are the 'fundimental properties' of various S language objects defined
> anywhere?  Surely, some properties are implementation accidents, others are
> implementation choices, and others are 'fundimental' to the behavior of the
> object.

No, although the blue and white books are the closest we have to a
definition.  In this case, the first line of page 57 of the white book
says row names must be unique.  It's exactly like relations (tables) in an
RDBMS: data frames correspond to relations with a primary key, and
database theory explains why that is a key (sorry) property.  So I am sure
this was no accident.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._