[R] aggregate() runs out of memory
Steve Lianoglou
mailinglist.honeypot at gmail.com
Mon Nov 26 23:32:21 CET 2012
Hi,
On Mon, Nov 26, 2012 at 4:57 PM, Sam Steingold <sds at gnu.org> wrote:
[snip]
>> Could you please copy paste the output of `(head(infl, 20))` as
>> well as an approximation of what the result is that you want.
Don't know how "dput" got clipped in your reply from the quoted text I
wrote, but I actually asked for `dput(head(infl, 20))`
The dput makes a world of difference because I can easily copy/paste
the output into R and get a working table.
> this prints all the levels for all the factor columns and takes
> megabytes.
Try using droplevels, eg:
R> dput(droplevels(head(infl, 20)))
> --8<---------------cut here---------------start------------->8---
>> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
>> f
> id country delay
> 1 1 6 1
> 2 2 7 2
> 3 3 8 3
> 4 1 6 4
> 5 2 7 5
> 6 3 8 6
> 7 1 6 7
> 8 2 7 8
> 9 3 8 9
> 10 1 6 10
> 11 2 7 11
> 12 3 8 12
>> f <- as.data.table(f)
>> setkey(f,id)
>> delays <- f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
>> delays
> id min max count country
> 1: 1 1 10 4 6
> 2: 2 2 11 4 7
> 3: 3 3 12 4 8
> --8<---------------cut here---------------end--------------->8---
>
> this is still too slow, apparently because of unique.
> how do I speed it up?
I think I'm missing something.
Your call to `min(delay)` and `max(delay)` will return the minimum and
maximum delays within the particular "id" you are grouping by. I guess
there must be several values for "country" within each "id" group --
do you really want the same min and max values to be replicated as
many times as there are unique "country"s?
Do you perhaps want to iterate over a combo of id and country?
Anyway: if you don't use `unique` inside your calculation, I guess it
goes significantly faster, like so:
R> result <- f[, list(min=min(delay), max=max(delay),
count=.N,country=country[1L]), by="share.id"]
If that's bearable, and you really want the way you suggest (or, at
least, what I'm interpreting), I wonder if this two-step would be
faster?
R> setkeyv(f, c('share.id', 'country'))
R> r1 <- f[, list(min=min(delay), max=max(delay), count=.N), by='share.id']
R> result <- unique(f)[r1] ## I think
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list