[R] Speed Advice for R --- avoid data frames

ivo welch ivo.welch at gmail.com
Sat Jul 2 20:35:08 CEST 2011

This email is intended for R users that are not that familiar with R
internals and are searching google about how to speed up R.

Despite common misperception, R is not slow when it comes to iterative
access.  R is fast when it comes to matrices.  R is very slow when it
comes to iterative access into data frames.  Such access occurs when a
user uses "data$varname[index]", which is a very common operation.  To
illustrate, run the following program:

R <- 1000; C <- 1000

example <- function(m) {
  cat("rows: "); cat(system.time( for (r in 1:R) m[r,20] <-
sqrt(abs(m[r,20])) + rnorm(1) ), "\n")
  cat("columns: "); cat(system.time(for (c in 1:C) m[20,c] <-
sqrt(abs(m[20,c])) + rnorm(1)), "\n")
  if (is.data.frame(m)) { cat("df: columns as names: ");
cat(system.time(for (c in 1:C) m[[c]][20] <- sqrt(abs(m[[c]][20])) +
rnorm(1)), "\n") }

cat("\n**** Now as matrix\n")
example( matrix( rnorm(C*R), nrow=R ) )

cat("\n**** Now as data frame\n")
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )

The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
with ample RAM:

matrix, columns: 0.01s
matrix, rows: 0.175s
data frame, columns: 53s
data frame, rows: 56s
data frame, names: 58s

Data frame access is about 5,000 times slower than matrix column
access, and 300 times slower than matrix row access.  R's data frame
operational speed is an amazing 40 data accesses per seconds.  I have
not seen access numbers this low for decades.

How to avoid it?  Not easy.  One way is to create multiple matrices,
and group them as an object.  of course, this loses a lot of features
of R.  Another way is to copy all data used in calculations out of the
data frame into a matrix, do the operations, and then copy them back.
not ideal, either.

In my opinion, this is an R design flow.  Data frames are the
fundamental unit of much statistical analysis, and should be fast.  I
think R lacks any indexing into data frames.  Turning on indexing of
data frames should at least be an optional feature.

I hope this message post helps others.


Ivo Welch (ivo.welch at gmail.com)

More information about the R-help mailing list