[BioC] tapply for enormous (>2^31 row) matrices

Wed Feb 22 21:47:31 CET 2012

Hi Matthew,

You don't need a supercomputer for that.

If you know in advance the nb of individual people, you can
preallocate the final result. Use a square numeric matrix for
the final result, where the rows and cols correspond to the
individuals. Conceptually, the matrix will be symmetric so you
just need to fill half of it. Then just read your big file by
chunks and skip the cols you don't need.

computeAllPairSums <- function(filename, nbindiv)
{
     con <- file(filename, open="r")
     on.exit(close(con))
     ans <- matrix(numeric(nbindiv * nbindiv), nrow=nbindiv)
     CHUNK_SIZE <- 5000000L
     chunk <- 0L
     while (TRUE) {
         df0 <- read.table(con,
                  col.names=c("ID1", "ID2", "ignored", "sharing"),
                  colClasses=c("integer", "integer", "NULL", "numeric"),
                  nrows=CHUNK_SIZE)
         if (nrow(df0) == 0L)
             break

         chunk <- chunk + 1L
         cat("Processing chunk", chunk, "... ")
         ## Swap ids just in case but maybe you can assume
         ## ID1 is always <= ID2 in your file.
         swap_idx <- df0[["ID2"]] < df0[["ID1"]]
         tmp <- df0[["ID1"]][swap_idx]
         df0[["ID1"]][swap_idx] <- df0[["ID2"]][swap_idx]
         df0[["ID2"]][swap_idx] <- tmp

         df <- aggregate(df0[["sharing"]], df0[c("ID1", "ID2")], sum)
         idx <- as.matrix(df[1:2])
         ans[idx] <- ans[idx] + df[[3L]]
         cat("OK\n")
     }
     ans
}

Takes 4 min 11 s and 1.5GB of RAM on my laptop to process a 80M
line file (with 'nbindiv=9000L').

Having the result in this square matrix might or might not be what
you wanted but note that it will allow fast lookup of any given pair
(or set of pairs) which is not the case with a 3 col data.frame.

Cheers,
H.

On 02/21/2012 03:11 PM, Matthew Keller wrote:
> Hello all,
>
> I just sent this to the main R forum, but realized this audience might
> have more familiarity with this type of problem...
>
>
> ---------- Forwarded message ----------
> From: Matthew Keller<mckellercran at gmail.com>
> Date: Tue, Feb 21, 2012 at 4:04 PM
> Subject: tapply for enormous (>2^31 row) matrices
> To: r help<r-help at r-project.org>
>
>
> Hi all,
>
> SETUP:
> I have pairwise data on 22 chromosomes. Data matrix X for a given
> chromosome looks like this:
>
> 1 13 58 1.12
> 6 142 56 1.11
> 18 307 64 3.13
> 22 320 58 0.72
>
> Where column 1 is person ID 1, column 2 is person ID 2, column 3 can
> be ignored, and column 4 is how much chromosomal sharing those two
> individuals have in some small portion of the chromosome. There are
> 9000 individual people, and therefore ~ (9000^2)/2 pairwise matches at
> each small location on the chromosome, so across an entire chromosome,
> these matrices are VERY large (e.g., 3 billion rows, which is>  the
> 2^31 vector size limitation in R). I have access to a server with 64
> bit R, 1TB RAM and 80 processors.
>
> PROBLEM:
> A pair of individuals (e.g., person 1 and 13 from the first row above)
> will show up multiple times in a given file. I want to sum column 4
> across each pair of individuals. If I could bring the matrix into R, I
> could use tapply() to accomplish this by indexing on
> paste(X[,1],X[,2]), but the matrix doesn't fit into R. I have been
> trying to use bigmemory and bigtabulate packages in R, but when I try
> to use the bigsplit function, R never completes the operation (after a
> day, I killed the process). In particular, I did this:
>
> X<- read.big.matrix("file.loc.X",sep=" ",type="double")
> hap.indices<- bigsplit(X,1:2) #this runs for too long to be useful on
> these matrices
> #I was then going to use foreach loop to sum across the splits
> identified by bigsplit
>
> SO - does anyone have ideas on how to deal with this problem - i.e.,
> how to use a tapply() like function on an enormous matrix? This isn't
> necessarily a bigtabulate question (although if I screwed up using
> bigsplit, let me know). If another package (e.g., an SQL package) can
> do something like this efficiently, I'd like to hear about it and your
> experiences using it.
>
> Thank you in advance,
>
> Matt
>
>
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319