[R] aggregate function - na.action

Matthew Dowle mdowle at mdowle.plus.com
Mon Feb 7 16:22:06 CET 2011

Hi Hadley,

Does FAQ 1.8 answer that ok ?
   "Ok, I'm starting to see what data.table is about, but why didn't you 
enhance data.frame in R? Why does it have to be a new package?"


"Hadley Wickham" <hadley at rice.edu> wrote in message 
news:AANLkTik180p4YmBtR3QUCW7r=FdeFXZBxSy3zWTiKNNM at mail.gmail.com...
On Mon, Feb 7, 2011 at 5:54 AM, Matthew Dowle <mdowle at mdowle.plus.com> 
> Looking at the timings by each stage may help :
>> system.time(dt <- data.table(dat))
> user system elapsed
> 1.20 0.28 1.48
>> system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8)) # sort by the
>> 8 columns (one-off)
> user system elapsed
> 4.72 0.94 5.67
>> system.time(udt <- dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2,
>> x3, x4, x5, x6, x7, x8'])
> user system elapsed
> 2.00 0.21 2.20 # compared to 11.07s
> data.table doesn't have a custom data structure, so it can't be that.
> data.table's structure is the same as data.frame i.e. a list of vectors.
> data.table inherits from data.frame. It *is* a data.frame, too.
> The reasons it is faster in this example include :
> 1. Memory is only allocated for the largest group.
> 2. That memory is re-used for each group.
> 3. Since the data is ordered contiguously in RAM, the memory is copied 
> over
> in bulk for each group using
> memcpy in C, which is faster than a for loop in C. Page fetches are
> expensive; they are minimised.

But this is exactly what I mean by a custom data structure - you're
not using the usual data frame API.

Wouldn't it be better to implement these changes to data frame so that
everyone can benefit? Or is it just too specialised to this particular
case (where I guess you're using that the return data structure of the
summary function is consistent)?


Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University

More information about the R-help mailing list