[R] aggregate function - na.action

Mon Feb 7 12:54:26 CET 2011

Looking at the timings by each stage may help :

>   system.time(dt <- data.table(dat))
   user  system elapsed
   1.20    0.28    1.48
>   system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8))   # sort by the 
> 8 columns (one-off)
   user  system elapsed
   4.72    0.94    5.67
>   system.time(udt <- dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2, 
> x3, x4, x5, x6, x7, x8'])
   user  system elapsed
   2.00    0.21    2.20     # compared to 11.07s
>

data.table doesn't have a custom data structure, so it can't be that.
data.table's structure is the same as data.frame i.e. a list of vectors.
data.table inherits from data.frame.  It *is* a data.frame, too.

The reasons it is faster in this example include :
1. Memory is only allocated for the largest group.
2. That memory is re-used for each group.
3. Since the data is ordered contiguously in RAM, the memory is copied over 
in bulk for each group using
memcpy in C, which is faster than a for loop in C. Page fetches are 
expensive; they are minimised.

This is explained in the documentation, in particular the FAQs.  This 
example is quite small, but the
concept scales to larger sizes i.e. the difference widens further as n 
increases.

http://datatable.r-forge.r-project.org/

Matthew

"Hadley Wickham" <hadley at rice.edu> wrote in message 
news:AANLkTim6DRfJxQRSqLXof1uT6xr_BSHqdbGpktMEDiC- at mail.gmail.com...
>> There's definitely something amiss with aggregate() here since similar
>> functions from other packages can reproduce your 'control' sum. I expect
>> ddply() will have some timing issues because of all the subgrouping in 
>> your
>> data frame, but data.table did very well and the summaryBy() function in 
>> the
>> doBy package did OK:
>
> Well, if you use the right plyr function, it works just fine:
>
> system.time(count(dat, c("x1", "x2", "x3", "x4", "x4", "x5", "x6",
> "x7", "x8"), "y"))
> #   user  system elapsed
> #  9.754   1.314  11.073
>
> Which illustrates something that I've believed for a while about
> data.table - it's not the indexing that speed things up, it's the
> custom data structure.  If you use ddply with data frames, it's slow
> because data frames are slow.  I think the right way to resolve this
> is to to make data frames more efficient, perhaps using some kind of
> mutable interface where necessary for high-performance operations.
>
> Hadley
>
> -- 
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice University
> http://had.co.nz/
>