[R] Speed Advice for R --- avoid data frames

Sat Jul 2 20:58:45 CEST 2011

Some comments:

the comparison matrix rows vs. matrix columns is incorrect: Note that R 
has lazy evaluation, hence you construct your matrix in the timing for 
the rows and it is already constructed in the timing for the columns, 
hence you want to use:

  M <- matrix( rnorm(C*R), nrow=R )
  D <- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
  example(M)
  example(D)

Further on, you are correct with you statement that data.frame indexing 
is much slower, but if you can store your data in matrix form, just go 
on as it is.

I doubt anybody is really going to make the index operation you cited 
within a loop. Then, with a data.frame, I can live with many vectorized 
replacements again:

 > system.time(D[,20] <- sqrt(abs(D[,20])) + rnorm(1000))
    user  system elapsed
    0.01    0.00    0.01

 > system.time(D[20,] <- sqrt(abs(D[20,])) + rnorm(1000))
    user  system elapsed
    0.51    0.00    0.52

OK, it would be nice to do that faster, but this is not easy. I think R 
Core is happy to see contributions to make it faster without breaking 
existing features.

Best wishes,
Uwe

On 02.07.2011 20:35, ivo welch wrote:
> This email is intended for R users that are not that familiar with R
> internals and are searching google about how to speed up R.
>
> Despite common misperception, R is not slow when it comes to iterative
> access.  R is fast when it comes to matrices.  R is very slow when it
> comes to iterative access into data frames.  Such access occurs when a
> user uses "data$varname[index]", which is a very common operation.  To
> illustrate, run the following program:
>
> R<- 1000; C<- 1000
>
> example<- function(m) {
>    cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
>    cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
>    if (is.data.frame(m)) { cat("df: columns as names: ");
> cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
> rnorm(1)), "\n") }
> }
>
> cat("\n**** Now as matrix\n")
> example( matrix( rnorm(C*R), nrow=R ) )
>
> cat("\n**** Now as data frame\n")
> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
>
>
> The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
> with ample RAM:
>
> matrix, columns: 0.01s
> matrix, rows: 0.175s
> data frame, columns: 53s
> data frame, rows: 56s
> data frame, names: 58s
>
> Data frame access is about 5,000 times slower than matrix column
> access, and 300 times slower than matrix row access.  R's data frame
> operational speed is an amazing 40 data accesses per seconds.  I have
> not seen access numbers this low for decades.
>
>
> How to avoid it?  Not easy.  One way is to create multiple matrices,
> and group them as an object.  of course, this loses a lot of features
> of R.  Another way is to copy all data used in calculations out of the
> data frame into a matrix, do the operations, and then copy them back.
> not ideal, either.
>
> In my opinion, this is an R design flow.  Data frames are the
> fundamental unit of much statistical analysis, and should be fast.  I
> think R lacks any indexing into data frames.  Turning on indexing of
> data frames should at least be an optional feature.
>
>
> I hope this message post helps others.
>
> /iaw
>
> ----
> Ivo Welch (ivo.welch at gmail.com)
> http://www.ivo-welch.info/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.