[R] tapply for enormous (>2^31 row) matrices

Wed Feb 22 23:20:15 CET 2012

On Tue, Feb 21, 2012 at 4:04 PM, Matthew Keller <mckellercran at gmail.com> wrote:

> X <- read.big.matrix("file.loc.X",sep=" ",type="double")
> hap.indices <- bigsplit(X,1:2) #this runs for too long to be useful on
> these matrices
> #I was then going to use foreach loop to sum across the splits
> identified by bigsplit

How about just using foreach earlier in the process ? e.g. split
file.loc.X to (80) sub files and then run
read.big.matrix/bigsplit/sum inside %dopar%

If splitting X beforehand is a problem, you could also use ?scan to
read in different chunks of the file, something like (untested
obviously):
# for X a matrix 800x4
lineind<- seq(1,800,100)  # create an index vec for the lines to read
ReducedX<- foreach(i = 1:8) %dopar%{
  x <- scan('file.loc.X',list(double(0),double(0),double(0),double(0)),skip=lineind[i],nlines=100)
... do your thing on x (aggregate/tapply etc.)
  }

Hope this helped
Elai.

>
> SO - does anyone have ideas on how to deal with this problem - i.e.,
> how to use a tapply() like function on an enormous matrix? This isn't
> necessarily a bigtabulate question (although if I screwed up using
> bigsplit, let me know). If another package (e.g., an SQL package) can
> do something like this efficiently, I'd like to hear about it and your
> experiences using it.
>
> Thank you in advance,
>
> Matt
>
>
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.