[Rd] extracting rows from a data frame by looping over the row names: performance issues

Ulf Martin ulfmartin at web.de
Fri Mar 2 23:19:09 CET 2007


Here is an even faster one; the general point is to create a properly
vectorized custom function/expression:

mymean <- function(x, y, z) (x+y+z)/3

a = data.frame(matrix(1:3e4, ncol=3))
attach(a)
print(system.time({r3 = mymean(X1,X2,X3)}))
detach(a)

# Yields:
# [1] 0.000 0.010 0.005 0.000 0.000

print(identical(r2, r3))
# [1] TRUE

# May values for version 1 and 2 resp. were
# time for r1:
[1] 29.420 23.090 60.093  0.000  0.000

# time for r2:
[1] 1.400 0.050 1.505 0.000 0.000

Best wishes
Ulf


P.S. A somewhat more meaningful comparison of version 2 and 3:

a = data.frame(matrix(1:3e5, ncol=3))
# time r2e5:
[1] 12.04  0.15 12.92  0.00  0.00

# time r3e5:
[1] 0.030 0.020 0.051 0.000 0.000

> depending on your problem, using "mapply" might help, as in the code 
> example below:
> 
> a = data.frame(matrix(1:3e4, ncol=3))
> 
> print(system.time({
> r1 = numeric(nrow(a))
> for(i in seq_len(nrow(a))) {
>    g = a[i,]
>    r1[i] = mean(c(g$X1, g$X2, g$X3))
> }}))
> 
> print(system.time({
> f = function(X1,X2,X3) mean(c(X1, X2, X3))
> r2 = do.call("mapply", args=append(f, a))
> }))
> 
> print(identical(r1, r2))
> 
> #   user  system elapsed
>    6.049   0.200   6.987
>     user  system elapsed
>    0.508   0.000   0.509
> [1] TRUE
> 
>   Best wishes
>    Wolfgang
> 
> Roger D. Peng wrote:
>> Extracting rows from data frames is tricky, since each of the columns could be 
>> of a different class.  For your toy example, it seems a matrix would be a more 
>> reasonable option.
>>
>> R-devel has some improvements to row extraction, if I remember correctly.  You 
>> might want to try your example there.
>>
>> -roger
>>
>> Herve Pages wrote:
>>> Hi,
>>>
>>>
>>> I have a big data frame:
>>>
>>>   > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>>>   > dat <- as.data.frame(mat)
>>>
>>> and I need to do some computation on each row. Currently I'm doing this:
>>>
>>>   > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... }
>>>
>>> which could probably considered a very natural (and R'ish) way of doing it
>>> (but maybe I'm wrong and the real idiom for doing this is something different).
>>>
>>> The problem with this "idiomatic form" is that it is _very_ slow. The loop
>>> itself + the simple extraction of the rows (no computation on the rows) takes
>>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>>
>>> Looping over the first 100 rows takes 12 seconds:
>>>
>>>   > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>>>      user  system elapsed
>>>    12.637   0.120  12.756
>>>
>>> But if, instead of the above, I do this:
>>>
>>>   > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>>
>>> then it's 20 times faster!!
>>>
>>>   > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
>>>      user  system elapsed
>>>     0.576   0.096   0.673
>>>
>>> I hope you will agree that this second form is much less natural.
>>>
>>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic
>>> form be, not only elegant and easy to read, but also efficient?
>>>
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>>> sessionInfo()
>>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"
>>> [7] "base"
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
> 
>



More information about the R-devel mailing list