[R] Slow indexing access for Matrix

Peter Dalgaard p.dalgaard at biostat.ku.dk
Mon Feb 23 23:13:51 CET 2009


Niels Richard Hansen wrote:
> Consider the following little "benchmark"
> 
>  > require(Matrix)
>  > tmp <- Matrix(c(rep(1,1000),rep(0,9000)),ncol=1)
>  > ind <- sample(1:10000,10000)
>  > system.time(tmp[ind,])
>    user  system elapsed
>   0.004   0.001   0.005
> 
>  > ind <- sample(1:1000,10000,replace=TRUE)
>  > system.time(tmp[ind,])
>    user  system elapsed
>   0.654   0.006   0.703
> 
>  > system.time(Matrix(as(tmp,"matrix")[ind,]))
>    user  system elapsed
>   0.005   0.000   0.006
> 
> First I access all 10000 rows in a random order, which is fast,
> but when I access the first 1000 rows 10000 times there is a
> considerable slowdown. Last I convert back and forth
> between matrix and Matrix and get a serious speedup. Am I missing
> a point here? Should I not use indexing with "[" for the
> sparse matrices if I have repeated indices?
> 
> I'm running Mac OS X, version 10.5.6, with Matrix package
> version 0.999375-21.
> 
> I hope that somebody can enlighten me on this issue.
> 
> Thanks, Niels

The sources have the answer, but I'm as reluctant to read them as you 
are. ;-)

The repeated indices are certainly an important part of it. Notice also 
that you'll have timings like

 > ind <- sample(1:10000,10000,replace=TRUE)
 > system.time(tmp[ind,])
    user  system elapsed
   0.884   0.000   1.302
 > ind <- sample(1:1000,10000,replace=TRUE)
 > system.time(tmp[ind,])
    user  system elapsed
   2.053   0.009   2.268
 > ind <- sample(1:10000,10000,replace=FALSE)
 > system.time(tmp[ind,])
    user  system elapsed
    0.01    0.00    0.01

It is, however, apparently unrelated to the sparseness of the result 
(sampling from 1001:2000 gives the same result).

Also
 > ind <- sample(1:5000,5000,replace=FALSE)
 > ind <- c(ind,ind)
 > system.time(tmp[ind,])
    user  system elapsed
   1.204   0.001   1.331

has a considerable part of the slowdown, as does

ind <- c(1:5000,1:5000)

Presumably the issue is that calculations on sparseness patterns are 
harder when there are repeated indices.

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907




More information about the R-help mailing list