[R] using apply with sparse matrix from package Matrix

Wed Sep 5 01:57:02 CEST 2012

On Tue, Sep 4, 2012 at 10:58 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:
>>>>>> Jennifer Lyon <jennifer.s.lyon at gmail.com>
>>>>>>     on Fri, 31 Aug 2012 17:22:57 -0600 writes:
>
>     > Hi:
>     > I was trying to use apply on a sparse matrix from package Matrix,
>     > and I get the error:
>
>     > Error in asMethod(object) :
>     > Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 106
>
>     > Is there a way to apply a function to all the rows without bumping
>     > into this problem?
>
>     > Here is a simplified example:
>
>     >> dim(sm)
>     > [1] 72913 43052
>
>     >> class(sm)
>     > [1] "dgCMatrix"
>     > attr(,"package")
>     > [1] "Matrix"
>
>     >> str(sm)
>     > Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
>     > ..@ i       : int [1:6590004] 789 801 802 1231 1236 11739 17817
>     > 17943 18148 18676 ...
>     > ..@ p       : int [1:43053] 0 147 303 450 596 751 908 1053 1188 1347 ...
>     > ..@ Dim     : int [1:2] 72913 43052
>     > ..@ Dimnames:List of 2
>     > .. ..$ : NULL
>     > .. ..$ : NULL
>     > ..@ x       : num [1:6590004] 0.601 0.527 0.562 0.641 0.684 ...
>     > ..@ factors : list()
>
>     >> my.sum<-apply(sm, 1, sum)
>     > Error in asMethod(object) :
>     > Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 106
>
> So, actually it would have worked (though not efficiently) if
> your sm matrix would have been much smaller.
>
> However,  we provide  rowSums(), rowMeans(), colSums(), colMeans()
> for all of our matrices, including the sparse ones.
>
> So your present problem can be solved using
>
> my.sum <- rowSums(sm)
>
> Best regards,
> Martin Maechler, ETH Zurich

Thank you for letting me know about rowSums(). Two points.  First,
sadly, I was unclear in my posting, and using "sum" was just an
example. In the real case I am using my own function on each row. I
guess the answer for this problem is that iteration is my friend. Good
to know.

Second, since I'm embarrassed to say I hadn't remembered rowSums(), for
cases when I needed the sum of the rows, I had just been postmultiplying
by a vector of 1's.  Just FYI, I thought I should try rowSums(), so did
a small timing trial, and it appears postmultiplying is faster than
rowSums. Run is as follows:

> str(sm)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:6590004] 721 926 1275 1791 2370 2755 3393 4638
5363 5566 ...
  ..@ p       : int [1:43053] 0 147 303 450 596 751 908 1053 1188 1347 ...
  ..@ Dim     : int [1:2] 72913 43052
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:6590004] 0.0735 0.3206 0.1861 0.1604 0.197 ...
  ..@ factors : list()

> library(rbenchmark)

#Just checking how expensive building a vector of 1's is - not very
#at least for matrix of the size I'm interested in
> benchmark(i1<-rep(1, ncol(sm)))
                    test replications elapsed relative user.self sys.self
1 i1 <- rep(1, ncol(sm))          100   0.119        1      0.12        0
  user.child sys.child
1          0         0

#Postmultiplying by 1's timing
> benchmark(la<-sm %*% i1)
             test replications elapsed relative user.self sys.self user.child
1 la <- sm %*% i1          100   5.993        1     5.993        0          0
  sys.child
1         0

#rowSums timing
> benchmark(la1<-rowSums(sm))
                test replications elapsed relative user.self sys.self
1 la1 <- rowSums(sm)          100  28.117        1    28.114    0.004
  user.child sys.child
1          0         0

#Make sure the results are the same
>  all(la==la1)
[1] TRUE

The Matrix package is awesome, and I appreciate you taking the
time to answer my questions.

Jen

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rbenchmark_0.3.1 Matrix_1.0-6     lattice_0.20-6

loaded via a namespace (and not attached):
[1] grid_2.15.1