[Rd] extracting rows from a data frame by looping over the row names: performance issues

Herve Pages hpages at fhcrc.org
Fri Mar 2 19:39:37 CET 2007


Hi,


I have a big data frame:

  > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
  > dat <- as.data.frame(mat)

and I need to do some computation on each row. Currently I'm doing this:

  > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... }

which could probably considered a very natural (and R'ish) way of doing it
(but maybe I'm wrong and the real idiom for doing this is something different).

The problem with this "idiomatic form" is that it is _very_ slow. The loop
itself + the simple extraction of the rows (no computation on the rows) takes
10 hours on a powerful server (quad core Linux with 8G of RAM)!

Looping over the first 100 rows takes 12 seconds:

  > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
     user  system elapsed
   12.637   0.120  12.756

But if, instead of the above, I do this:

  > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }

then it's 20 times faster!!

  > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
     user  system elapsed
    0.576   0.096   0.673

I hope you will agree that this second form is much less natural.

So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic
form be, not only elegant and easy to read, but also efficient?


Thanks,
H.


> sessionInfo()
R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C

attached base packages:
[1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"
[7] "base"



More information about the R-devel mailing list