[R] Speed Advice for R --- avoid data frames
Uwe Ligges
ligges at statistik.tu-dortmund.de
Sat Jul 2 20:58:45 CEST 2011
Some comments:
the comparison matrix rows vs. matrix columns is incorrect: Note that R
has lazy evaluation, hence you construct your matrix in the timing for
the rows and it is already constructed in the timing for the columns,
hence you want to use:
M <- matrix( rnorm(C*R), nrow=R )
D <- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
example(M)
example(D)
Further on, you are correct with you statement that data.frame indexing
is much slower, but if you can store your data in matrix form, just go
on as it is.
I doubt anybody is really going to make the index operation you cited
within a loop. Then, with a data.frame, I can live with many vectorized
replacements again:
> system.time(D[,20] <- sqrt(abs(D[,20])) + rnorm(1000))
user system elapsed
0.01 0.00 0.01
> system.time(D[20,] <- sqrt(abs(D[20,])) + rnorm(1000))
user system elapsed
0.51 0.00 0.52
OK, it would be nice to do that faster, but this is not easy. I think R
Core is happy to see contributions to make it faster without breaking
existing features.
Best wishes,
Uwe
On 02.07.2011 20:35, ivo welch wrote:
> This email is intended for R users that are not that familiar with R
> internals and are searching google about how to speed up R.
>
> Despite common misperception, R is not slow when it comes to iterative
> access. R is fast when it comes to matrices. R is very slow when it
> comes to iterative access into data frames. Such access occurs when a
> user uses "data$varname[index]", which is a very common operation. To
> illustrate, run the following program:
>
> R<- 1000; C<- 1000
>
> example<- function(m) {
> cat("rows: "); cat(system.time( for (r in 1:R) m[r,20]<-
> sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
> cat("columns: "); cat(system.time(for (c in 1:C) m[20,c]<-
> sqrt(abs(m[20,c])) + rnorm(1)), "\n")
> if (is.data.frame(m)) { cat("df: columns as names: ");
> cat(system.time(for (c in 1:C) m[[c]][20]<- sqrt(abs(m[[c]][20])) +
> rnorm(1)), "\n") }
> }
>
> cat("\n**** Now as matrix\n")
> example( matrix( rnorm(C*R), nrow=R ) )
>
> cat("\n**** Now as data frame\n")
> example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )
>
>
> The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
> with ample RAM:
>
> matrix, columns: 0.01s
> matrix, rows: 0.175s
> data frame, columns: 53s
> data frame, rows: 56s
> data frame, names: 58s
>
> Data frame access is about 5,000 times slower than matrix column
> access, and 300 times slower than matrix row access. R's data frame
> operational speed is an amazing 40 data accesses per seconds. I have
> not seen access numbers this low for decades.
>
>
> How to avoid it? Not easy. One way is to create multiple matrices,
> and group them as an object. of course, this loses a lot of features
> of R. Another way is to copy all data used in calculations out of the
> data frame into a matrix, do the operations, and then copy them back.
> not ideal, either.
>
> In my opinion, this is an R design flow. Data frames are the
> fundamental unit of much statistical analysis, and should be fast. I
> think R lacks any indexing into data frames. Turning on indexing of
> data frames should at least be an optional feature.
>
>
> I hope this message post helps others.
>
> /iaw
>
> ----
> Ivo Welch (ivo.welch at gmail.com)
> http://www.ivo-welch.info/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list