[R] aggregate() runs out of memory

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri Sep 14 23:30:59 CEST 2012


Hi,

On Fri, Sep 14, 2012 at 4:26 PM, Dennis Murphy <djmuser at gmail.com> wrote:
> Hi:
>
> This should give you some idea of what Steve is talking about:
>
> library(data.table)
> dt <- data.table(x = sample(100000, 10000000, replace = TRUE),
>                   y = rnorm(10000000), key = "x")
> dt[, .N, by = x]
> system.time(dt[, .N, by = x])
>
> ...on my system, dual core 8Gb RAM running Win7 64-bit,
>> system.time(dt[, .N, by = x])
>    user  system elapsed
>    0.12    0.02    0.14
>
> .N is an optimized function to find the number of rows of each data subset.
> Much faster than aggregate(). It might take a little longer because you
> have more columns that suck up space, but you get the idea. It's also about
> 5-6 times faster if you set a key variable in the data table than if you
> don't.

Well done, sir! (slight critique in that .N isn't a function, it's
just a variable that is constantly reset within each by-subset/group)

Also, don't forget to use the .SDcols parameter in [.data.table if you
plan on only using a subset of the columns in side your "by" stuff.

There's lots of documentation in the package `?data.table` and the
vignettes/FAQ to help you tweak your usage, if you decide to take
data.table route.

HTH,
-steve

>
> Dennis
>
> On Fri, Sep 14, 2012 at 12:26 PM, Sam Steingold <sds at gnu.org> wrote:
>
>> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17
>> columns).
>> I want to get the result of
>> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
>> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
>> 24.3G, and no end in sight.
>> both V1 and V2 are characters (not factors).
>> Is there anything I could do to speed this up?
>> Thanks.
>>
>> --
>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
>> 11.0.11103000
>> http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
>> http://dhimmi.com http://think-israel.org http://iris.org.il
>> WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact




More information about the R-help mailing list