[R] aggregate function - na.action

Hadley Wickham hadley at rice.edu
Mon Feb 7 14:55:52 CET 2011


On Mon, Feb 7, 2011 at 5:54 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Looking at the timings by each stage may help :
>
>>   system.time(dt <- data.table(dat))
>   user  system elapsed
>   1.20    0.28    1.48
>>   system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8))   # sort by the
>> 8 columns (one-off)
>   user  system elapsed
>   4.72    0.94    5.67
>>   system.time(udt <- dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2,
>> x3, x4, x5, x6, x7, x8'])
>   user  system elapsed
>   2.00    0.21    2.20     # compared to 11.07s
>>
>
> data.table doesn't have a custom data structure, so it can't be that.
> data.table's structure is the same as data.frame i.e. a list of vectors.
> data.table inherits from data.frame.  It *is* a data.frame, too.
>
> The reasons it is faster in this example include :
> 1. Memory is only allocated for the largest group.
> 2. That memory is re-used for each group.
> 3. Since the data is ordered contiguously in RAM, the memory is copied over
> in bulk for each group using
> memcpy in C, which is faster than a for loop in C. Page fetches are
> expensive; they are minimised.

But this is exactly what I mean by a custom data structure - you're
not using the usual data frame API.

Wouldn't it be better to implement these changes to data frame so that
everyone can benefit? Or is it just too specialised to this particular
case (where I guess you're using that the return data structure of the
summary function is consistent)?

Hadley


-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/



More information about the R-help mailing list