[R] aggregate() runs out of memory

Mon Nov 26 23:59:27 CET 2012

Hi,

> * Steve Lianoglou <znvyvatyvfg.ubarlcbg at tznvy.pbz> [2012-11-26 17:32:21 -0500]:
>
>> --8<---------------cut here---------------start------------->8---
>>> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
>>> f
>>    id country delay
>> 1   1       6     1
>> 2   2       7     2
>> 3   3       8     3
>> 4   1       6     4
>> 5   2       7     5
>> 6   3       8     6
>> 7   1       6     7
>> 8   2       7     8
>> 9   3       8     9
>> 10  1       6    10
>> 11  2       7    11
>> 12  3       8    12
>>> f <- as.data.table(f)
>>> setkey(f,id)
>>> delays <- f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
>>> delays
>>    id min max count country
>> 1:  1   1  10     4       6
>> 2:  2   2  11     4       7
>> 3:  3   3  12     4       8
>> --8<---------------cut here---------------end--------------->8---
>>
>> this is still too slow, apparently because of unique.
>> how do I speed it up?
>
> I think I'm missing something.
>
> Your call to `min(delay)` and `max(delay)` will return the minimum and
> maximum delays within the particular "id" you are grouping by. I guess
> there must be several values for "country" within each "id" group --
> do you really want the same min and max values to be replicated as
> many times as there are unique "country"s?

there is precisely one country for each id.
i.e., unique(country) is the same as country[1].
thanks a lot for the suggestion!

> R> result <- f[, list(min=min(delay), max=max(delay),
> count=.N,country=country[1L]), by="share.id"]

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://thereligionofpeace.com http://pmw.org.il
http://honestreporting.com http://americancensorship.org
Why do you never call me back after I scream that I will never talk to you again?!