[BioC] tapply for enormous (>2^31 row) matrices
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed Feb 22 01:10:02 CET 2012
Hi,
On Tue, Feb 21, 2012 at 6:11 PM, Matthew Keller <mckellercran at gmail.com> wrote:
> Hello all,
>
> I just sent this to the main R forum, but realized this audience might
> have more familiarity with this type of problem...
If you're determined to do this in R, I'd split your file into a few
smaller ones (you can even use the *nix `split` command), do your
group-by-and-summarize on the smaller files and in different R
processes, then summarize your summaries (sounds like a job for
hadoop, no?)
For your `tapply` functionality, I'd look to the data.table package --
it has super-faset group-by mojo, and tries to be as memory efficient
as possible.
Assuming you can get your (subset) of data into a data.frame `df` and
that your column names are something like, c("ID1", "ID2", "XX",
"score"), you'd then:
R> library(data.table)
R> df <- as.data.table(df) ## makes a copy
R> setkeyv(df, c("ID1", "ID2")) ## no copy
R> ans <- df[, list(shared=sum(score)), by=key(df)]
Summarizing the results from separate processes will be trivial.
Loading your data into a data.frame to start with, however, will
likely take painfully long.
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list