[R] memory-efficient column aggregation of a sparse matrix
roger koenker
rkoenker at uiuc.edu
Thu Feb 1 14:30:48 CET 2007
Doug is right, I think, that this would be easier with full indexing
using the matrix.coo classe, if you want to use SparseM. But
then the tapply seems to be the way to go.
url: www.econ.uiuc.edu/~roger Roger Koenker
email rkoenker at uiuc.edu Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820
On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote:
> On 1/31/07, Jon Stearley <jrstear at sandia.gov> wrote:
>> I need to sum the columns of a sparse matrix according to a factor -
>> ie given a sparse matrix X and a factor fac of length ncol(X), sum
>> the elements by column factors and return the sparse matrix Y of size
>> nrow(X) by nlevels(f). The appended code does the job, but is
>> unacceptably memory-bound because tapply() uses a non-sparse
>> representation. Can anyone suggest a more memory and cpu efficient
>> approach? Eg, a sparse matrix tapply method? Thanks.
>
> This is the sort of operation that is much more easily performed in
> the triplet representation of a sparse matrix where each nonzero
> element is represented by its row index, column index and value.
> Using that representation you could map the column indices according
> to the factor then convert back to one of the other representations.
> The only question would be what to do about nonzeros in different
> columns of the original matrix that get mapped to the same element in
> the result. It turns out that in the sparse matrix code used by the
> Matrix package the triplet representation allows for duplicate index
> positions with the convention that the resulting value at a position
> is the sum of the values of any triplets with that index pair.
>
> If you decide to use this approach please be aware that the indices
> for the triplet representation in the Matrix package are 0-based (as
> in C code) not 1-based (as in R code). (I imagine that Martin is
> thinking "we really should change that" as he reads this part.)
>
>>
>> --
>> +--------------------------------------------------------------+
>> | Jon Stearley (505) 845-7571 (FAX 844-9297) |
>> | Sandia National Laboratories Scalable Systems Integration |
>> +--------------------------------------------------------------+
>>
>>
>> # x and y are of SparseM class matrix.csr
>> "aggregate.csr" <-
>> function(x, fac) {
>> # make a vector indicating the row of each nonzero
>> rows <- integer(length=length(x at ra))
>> rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row
>> rows <- as.integer(cumsum(rows)) # and finish with a cumsum
>>
>> # make a vector indicating the column factor of each nonzero
>> f <- fac[x at ja]
>>
>> # aggregate by row,f
>> y <- tapply(x at ra, list(rows,f), sum)
>>
>> # sparsify it
>> y[is.na(y)] <- 0 # change tapply NAs to as.matrix.csr 0s
>> y <- as.matrix.csr(y)
>>
>> y
>> }
>>
More information about the R-help
mailing list