[R] aggregate() runs out of memory

Mon Nov 26 23:32:21 CET 2012

Hi,

On Mon, Nov 26, 2012 at 4:57 PM, Sam Steingold <sds at gnu.org> wrote:
[snip]
>> Could you please copy paste the output of `(head(infl, 20))` as
>> well as an approximation of what the result is that you want.

Don't know how "dput" got clipped in your reply from the quoted text I
wrote, but I actually asked for `dput(head(infl, 20))`

The dput makes a world of difference because I can easily copy/paste
the output into R and get a working table.

> this prints all the levels for all the factor columns and takes
> megabytes.

Try using droplevels, eg:

R> dput(droplevels(head(infl, 20)))

> --8<---------------cut here---------------start------------->8---
>> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
>> f
>    id country delay
> 1   1       6     1
> 2   2       7     2
> 3   3       8     3
> 4   1       6     4
> 5   2       7     5
> 6   3       8     6
> 7   1       6     7
> 8   2       7     8
> 9   3       8     9
> 10  1       6    10
> 11  2       7    11
> 12  3       8    12
>> f <- as.data.table(f)
>> setkey(f,id)
>> delays <- f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
>> delays
>    id min max count country
> 1:  1   1  10     4       6
> 2:  2   2  11     4       7
> 3:  3   3  12     4       8
> --8<---------------cut here---------------end--------------->8---
>
> this is still too slow, apparently because of unique.
> how do I speed it up?

I think I'm missing something.

Your call to `min(delay)` and `max(delay)` will return the minimum and
maximum delays within the particular "id" you are grouping by. I guess
there must be several values for "country" within each "id" group --
do you really want the same min and max values to be replicated as
many times as there are unique "country"s?

Do you perhaps want to iterate over a combo of id and country?

Anyway: if you don't use `unique` inside your calculation, I guess it
goes significantly faster, like so:

R> result <- f[, list(min=min(delay), max=max(delay),
count=.N,country=country[1L]), by="share.id"]

If that's bearable, and you really want the way you suggest (or, at
least, what I'm interpreting), I wonder if this two-step would be
faster?

R> setkeyv(f, c('share.id', 'country'))
R> r1 <- f[, list(min=min(delay), max=max(delay), count=.N), by='share.id']
R> result <- unique(f)[r1]  ## I think

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact