[R] memory-efficient column aggregation of a sparse matrix

Thu Feb 1 02:59:30 CET 2007

I need to sum the columns of a sparse matrix according to a factor -  
ie given a sparse matrix X and a factor fac of length ncol(X), sum  
the elements by column factors and return the sparse matrix Y of size  
nrow(X) by nlevels(f).  The appended code does the job, but is  
unacceptably memory-bound because tapply() uses a non-sparse  
representation.  Can anyone suggest a more memory and cpu efficient  
approach?  Eg, a sparse matrix tapply method?  Thanks.

-- 
+--------------------------------------------------------------+
| Jon Stearley                  (505) 845-7571  (FAX 844-9297) |
| Sandia National Laboratories  Scalable Systems Integration   |
+--------------------------------------------------------------+

# x and y are of SparseM class matrix.csr
"aggregate.csr" <-
function(x, fac) {
         # make a vector indicating the row of each nonzero
         rows <- integer(length=length(x at ra))
         rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row
         rows <- as.integer(cumsum(rows)) # and finish with a cumsum

         # make a vector indicating the column factor of each nonzero
         f <- fac[x at ja]

         # aggregate by row,f
         y <- tapply(x at ra, list(rows,f), sum)

         # sparsify it
         y[is.na(y)] <- 0  # change tapply NAs to as.matrix.csr 0s
         y <- as.matrix.csr(y)

         y
}