[R] Using plyr::dply more (memory) efficiently?

Thu Apr 29 17:12:22 CEST 2010

Hi Matthew,

On Thu, Apr 29, 2010 at 9:52 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> I don't know about that,  but try this :
>
> install.packages("data.table", repos="http://R-Forge.R-project.org")
> require(data.table)
> summaries = data.table(summaries)
> summaries[,sum(counts),by=symbol]
>
> Please let us know if that returns the correct result,  and if its
> memory/speed is ok ?

Thanks for directing me to the data.table package. I read through some
of the vignettes, and it looks quite nice.

While your sample code would provide answer if I wanted to just
compute some summary statistic/function of groups of my data.frame
(using `by=symbol`), what's the best way to produces several pieces of
info per subset.

For instance, I see that I can do something like this:

summaries[, list(counts=sum(counts), width=sum(exon.width)), by=symbol]

But what if I need to do some more complex processing within the
subsets defined in `by=symbol` -- like several lines of programming
logic for 1 result, say.

I guess I can open a new block that just returns a data.table? Like:

summaries[, {
  cnts <- sum(counts)
  ew <- sum(exon.width)
  # ... some complex things
  complex <- # .. result of complex things
  data.table(counts=cnts, width=ew, cplx=complex)
}, by=symbol]

Is that right? (I mean, it looks like it's working, but maybe there's
a more idiomatic way(?))

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact