[R] aggregate function - na.action

Mon Feb 7 02:43:01 CET 2011

On Feb 6, 2011, at 7:41 PM, Hadley Wickham wrote:

>> There's definitely something amiss with aggregate() here since  
>> similar
>> functions from other packages can reproduce your 'control' sum. I  
>> expect
>> ddply() will have some timing issues because of all the subgrouping  
>> in your
>> data frame, but data.table did very well and the summaryBy()  
>> function in the
>> doBy package did OK:
>
> Well, if you use the right plyr function, it works just fine:
>
> system.time(count(dat, c("x1", "x2", "x3", "x4", "x4", "x5", "x6",
> "x7", "x8"), "y"))
> #   user  system elapsed
> #  9.754   1.314  11.073
>
> Which illustrates something that I've believed for a while about
> data.table - it's not the indexing that speed things up, it's the
> custom data structure.  If you use ddply with data frames, it's slow
> because data frames are slow.  I think the right way to resolve this
> is to to make data frames more efficient, perhaps using some kind of
> mutable interface where necessary for high-performance operations.

Data.frames are also "fat". Simply adding a single new column to a  
dataset bordering on "large" (5 million rows by 200 columns) requires  
more memory than even twice the size of the full dataframe. (Paging  
ensues on a Mac with 24GB.)  Unless, of course, there is a more memory- 
efficient strategy than:

  dfrm$newcol <- with(dfrm, func(variables) ).

The table() operation on the other hand is blazingly fast and requires  
practically no memory.

-- 

David Winsemius, MD
West Hartford, CT