[R] aggregate function - na.action
David Winsemius
dwinsemius at comcast.net
Mon Feb 7 02:43:01 CET 2011
On Feb 6, 2011, at 7:41 PM, Hadley Wickham wrote:
>> There's definitely something amiss with aggregate() here since
>> similar
>> functions from other packages can reproduce your 'control' sum. I
>> expect
>> ddply() will have some timing issues because of all the subgrouping
>> in your
>> data frame, but data.table did very well and the summaryBy()
>> function in the
>> doBy package did OK:
>
> Well, if you use the right plyr function, it works just fine:
>
> system.time(count(dat, c("x1", "x2", "x3", "x4", "x4", "x5", "x6",
> "x7", "x8"), "y"))
> # user system elapsed
> # 9.754 1.314 11.073
>
> Which illustrates something that I've believed for a while about
> data.table - it's not the indexing that speed things up, it's the
> custom data structure. If you use ddply with data frames, it's slow
> because data frames are slow. I think the right way to resolve this
> is to to make data frames more efficient, perhaps using some kind of
> mutable interface where necessary for high-performance operations.
Data.frames are also "fat". Simply adding a single new column to a
dataset bordering on "large" (5 million rows by 200 columns) requires
more memory than even twice the size of the full dataframe. (Paging
ensues on a Mac with 24GB.) Unless, of course, there is a more memory-
efficient strategy than:
dfrm$newcol <- with(dfrm, func(variables) ).
The table() operation on the other hand is blazingly fast and requires
practically no memory.
--
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list