[BioC] tapply for enormous (>2^31 row) matrices

Wed Feb 22 00:11:09 CET 2012

Hello all,

I just sent this to the main R forum, but realized this audience might
have more familiarity with this type of problem...

---------- Forwarded message ----------
From: Matthew Keller <mckellercran at gmail.com>
Date: Tue, Feb 21, 2012 at 4:04 PM
Subject: tapply for enormous (>2^31 row) matrices
To: r help <r-help at r-project.org>

Hi all,

SETUP:
I have pairwise data on 22 chromosomes. Data matrix X for a given
chromosome looks like this:

1 13 58 1.12
6 142 56 1.11
18 307 64 3.13
22 320 58 0.72

Where column 1 is person ID 1, column 2 is person ID 2, column 3 can
be ignored, and column 4 is how much chromosomal sharing those two
individuals have in some small portion of the chromosome. There are
9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at
each small location on the chromosome, so across an entire chromosome,
these matrices are VERY large (e.g., 3 billion rows, which is > the
2^31 vector size limitation in R). I have access to a server with 64
bit R, 1TB RAM and 80 processors.

PROBLEM:
A pair of individuals (e.g., person 1 and 13 from the first row above)
will show up multiple times in a given file. I want to sum column 4
across each pair of individuals. If I could bring the matrix into R, I
could use tapply() to accomplish this by indexing on
paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been
trying to use bigmemory and bigtabulate packages in R, but when I try
to use the bigsplit function, R never completes the operation (after a
day, I killed the process). In particular, I did this:

X <- read.big.matrix("file.loc.X",sep=" ",type="double")
hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on
these matrices
#I was then going to use foreach loop to sum across the splits
identified by bigsplit

SO - does anyone have ideas on how to deal with this problem - i.e.,
how to use a tapply() like function on an enormous matrix? This isn't
necessarily a bigtabulate question (although if I screwed up using
bigsplit, let me know). If another package (e.g., an SQL package) can
do something like this efficiently, I'd like to hear about it and your
experiences using it.

Thank you in advance,

Matt

--
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com