[BioC] [R] tapply for enormous (>2^31 row) matrices

Thu Feb 23 17:39:40 CET 2012

Thank you all very much for your help (on both the r-help and the
bioconductor listserves).

Benilton - I couldn't get sqldf to install on the server I'm using
(error is: Error : package 'gsubfn' does not have a name space). I
think this was a problem for R 2.13, and I'm trying to get the admin's
to install a more up-to-date version. I know that I need to probably
learn a modicum of SQL given the sizes of datasets I'm using now.

I ended up using a modified version of Hervé Pagès' excellent code
(thank you!). I got a huge (40-fold) speed bump by using the
data.table package for indexing/aggregate steps, making an hours long
job a minutes long job. SO - read.table is hugely useful if you're
dealing with indexing/apply-family functions on huge datasets. By the
way, I'm not sure why, but read.table was a bit faster than scan for
this problem... Here is the code for others:

require(data.table)

computeAllPairSums <- function(filename, nbindiv,nrows.to.read)
{
   con <- file(filename, open="r")
   on.exit(close(con))
   ans <- matrix(numeric(nbindiv * nbindiv), nrow=nbindiv)
   chunk <- 0L
   while (TRUE) {
       #read.table faster than scan
       df0 <- read.table(con,col.names=c("ID1", "ID2", "ignored", "sharing"),
                colClasses=c("integer", "integer", "NULL",
"numeric"),nrows=nrows.to.read,comment.char="")

       DT <- data.table(df0)
       setkey(DT,ID1,ID2)
       ss <- DT[,sum(sharing),by="ID1,ID2"]

       if (nrow(df0) == 0L)
           break

       chunk <- chunk + 1L
       cat("Processing chunk", chunk, "... ")

      idd <- as.matrix(subset(ss,select=1:2))
      newvec <- as.vector(as.matrix(subset(ss,select=3)))
      ans[idd] <- ans[idd] + newvec

         cat("OK\n")
     }
   ans
 }

On Wed, Feb 22, 2012 at 3:20 PM, ilai <keren at math.montana.edu> wrote:
> On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller <mckellercran at gmail.com> wrote:
>
>> X <- read.big.matrix("file.loc.X",sep=" ",type="double")
>> hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on
>> these matrices
>> #I was then going to use foreach loop to sum across the splits
>> identified by bigsplit
>
> How about just using foreach earlier in the process ? e.g. split
> file.loc.X to (80) sub files and then run
> read.big.matrix/bigsplit/sum inside %dopar%
>
> If splitting X beforehand is a problem, you could also use ?scan to
> read in different chunks of the file, something like (untested
> obviously):
> # for X a matrix 800x4
> lineind<- seq(1,800,100)  # create an index vec for the lines to read
> ReducedX<- foreach(i = 1:8) %dopar%{
>  x <- scan('file.loc.X',list(double(0),double(0),double(0),double(0)),skip=lineind[i],nlines=100)
> ... do your thing on x (aggregate/tapply etc.)
>  }
>
> Hope this helped
> Elai.
>
>
>
>>
>> SO - does anyone have ideas on how to deal with this problem - i.e.,
>> how to use a tapply() like function on an enormous matrix? This isn't
>> necessarily a bigtabulate question (although if I screwed up using
>> bigsplit, let me know). If another package (e.g., an SQL package) can
>> do something like this efficiently, I'd like to hear about it and your
>> experiences using it.
>>
>> Thank you in advance,
>>
>> Matt
>>
>>
>>
>> --
>> Matthew C Keller
>> Asst. Professor of Psychology
>> University of Colorado at Boulder
>> www.matthewckeller.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

-- 
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com