[Rd] extracting rows from a data frame by looping over the row names: performance issues

Fri Mar 2 23:13:13 CET 2007

Roger D. Peng wrote:
> Extracting rows from data frames is tricky, since each of the columns
> could be of a different class.  For your toy example, it seems a matrix
> would be a more reasonable option.

There is no doubt about this ;-)

  > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
  > dat <- as.data.frame(mat)

With the matrix:

  > system.time(for (i in 1:100) { row <- mat[i, ] })
     user  system elapsed
        0       0       0

With the data frame:

  > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
     user  system elapsed
   12.565   0.296  12.859

And even with a mixed-type data frame, it's very tempting to convert it
to a matrix before to do any loop on it:

  > dat2 <- as.data.frame(mat, stringsAsFactors=FALSE)
  > dat2 <- cbind(dat2, ii=1:300000)
  > sapply(dat2, typeof)
           V1          V2          V3          V4          V5          ii
  "character" "character" "character" "character" "character"   "integer"

  > system.time(for (key in row.names(dat2)[1:100]) { row <- dat2[key, ] })
     user  system elapsed
   13.201   0.144  13.360

  > system.time({mat2 <- as.matrix(dat2); for (i in 1:100) { row <- mat2[i, ] }})
     user  system elapsed
    0.128   0.036   0.163

Big win isn't it? (only if you have enough memory for it though...)

Cheers,
H.

> 
> R-devel has some improvements to row extraction, if I remember
> correctly.  You might want to try your example there.
> 
> -roger
> 
> Herve Pages wrote:
>> Hi,
>>
>>
>> I have a big data frame:
>>
>>   > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>>   > dat <- as.data.frame(mat)
>>
>> and I need to do some computation on each row. Currently I'm doing this:
>>
>>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some
>> computation on row... }
>>
>> which could probably considered a very natural (and R'ish) way of
>> doing it
>> (but maybe I'm wrong and the real idiom for doing this is something
>> different).
>>
>> The problem with this "idiomatic form" is that it is _very_ slow. The
>> loop
>> itself + the simple extraction of the rows (no computation on the
>> rows) takes
>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>
>> Looping over the first 100 rows takes 12 seconds:
>>
>>   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>>      user  system elapsed
>>    12.637   0.120  12.756
>>
>> But if, instead of the above, I do this:
>>
>>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>
>> then it's 20 times faster!!
>>
>>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col)
>> col[i]) })
>>      user  system elapsed
>>     0.576   0.096   0.673
>>
>> I hope you will agree that this second form is much less natural.
>>
>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the
>> idiomatic
>> form be, not only elegant and easy to read, but also efficient?
>>
>>
>> Thanks,
>> H.
>>
>>
>>> sessionInfo()
>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>
>>
>> attached base packages:
>> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"
>> [7] "base"
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>