[Rd] extracting rows from a data frame by looping over the row names: performance issues
Herve Pages
hpages at fhcrc.org
Fri Mar 2 23:51:53 CET 2007
Ulf Martin wrote:
> Here is an even faster one; the general point is to create a properly
> vectorized custom function/expression:
>
> mymean <- function(x, y, z) (x+y+z)/3
>
> a = data.frame(matrix(1:3e4, ncol=3))
> attach(a)
> print(system.time({r3 = mymean(X1,X2,X3)}))
> detach(a)
>
> # Yields:
> # [1] 0.000 0.010 0.005 0.000 0.000
>
Very fast indeed! And you don't need the attach/detach trick to make your point
since it is (almost) as fast without it:
a = data.frame(matrix(1:3e4, ncol=3))
print(system.time({r3 = mymean(a$X1,a$X2,a$X3)}))
However, you are lucky here because in this example (the "mean" example), you can
use vectorized arithmetic which is of course very fast.
What about the general case? Unfortunately situations where you can "properly vectorize"
tend to be much more frequent in tutorials and demos than in the real world.
Maybe the "mean" example is a little bit too specific to answer the
general question of "what's the best way to _efficiently_ step on a data
frame row by row".
Cheers,
H.
> print(identical(r2, r3))
> # [1] TRUE
>
> # May values for version 1 and 2 resp. were
> # time for r1:
> [1] 29.420 23.090 60.093 0.000 0.000
>
> # time for r2:
> [1] 1.400 0.050 1.505 0.000 0.000
>
> Best wishes
> Ulf
>
>
> P.S. A somewhat more meaningful comparison of version 2 and 3:
>
> a = data.frame(matrix(1:3e5, ncol=3))
> # time r2e5:
> [1] 12.04 0.15 12.92 0.00 0.00
>
> # time r3e5:
> [1] 0.030 0.020 0.051 0.000 0.000
>
>> depending on your problem, using "mapply" might help, as in the code
>> example below:
>>
>> a = data.frame(matrix(1:3e4, ncol=3))
>>
>> print(system.time({
>> r1 = numeric(nrow(a))
>> for(i in seq_len(nrow(a))) {
>> g = a[i,]
>> r1[i] = mean(c(g$X1, g$X2, g$X3))
>> }}))
>>
>> print(system.time({
>> f = function(X1,X2,X3) mean(c(X1, X2, X3))
>> r2 = do.call("mapply", args=append(f, a))
>> }))
>>
>> print(identical(r1, r2))
>>
>> # user system elapsed
>> 6.049 0.200 6.987
>> user system elapsed
>> 0.508 0.000 0.509
>> [1] TRUE
>>
>> Best wishes
>> Wolfgang
>>
>> Roger D. Peng wrote:
>>> Extracting rows from data frames is tricky, since each of the columns could be
>>> of a different class. For your toy example, it seems a matrix would be a more
>>> reasonable option.
>>>
>>> R-devel has some improvements to row extraction, if I remember correctly. You
>>> might want to try your example there.
>>>
>>> -roger
>>>
>>> Herve Pages wrote:
>>>> Hi,
>>>>
>>>>
>>>> I have a big data frame:
>>>>
>>>> > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>>>> > dat <- as.data.frame(mat)
>>>>
>>>> and I need to do some computation on each row. Currently I'm doing this:
>>>>
>>>> > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... }
>>>>
>>>> which could probably considered a very natural (and R'ish) way of doing it
>>>> (but maybe I'm wrong and the real idiom for doing this is something different).
>>>>
>>>> The problem with this "idiomatic form" is that it is _very_ slow. The loop
>>>> itself + the simple extraction of the rows (no computation on the rows) takes
>>>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>>>
>>>> Looping over the first 100 rows takes 12 seconds:
>>>>
>>>> > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>>>> user system elapsed
>>>> 12.637 0.120 12.756
>>>>
>>>> But if, instead of the above, I do this:
>>>>
>>>> > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>>>
>>>> then it's 20 times faster!!
>>>>
>>>> > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
>>>> user system elapsed
>>>> 0.576 0.096 0.673
>>>>
>>>> I hope you will agree that this second form is much less natural.
>>>>
>>>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic
>>>> form be, not only elegant and easy to read, but also efficient?
>>>>
>>>>
>>>> Thanks,
>>>> H.
>>>>
>>>>
>>>>> sessionInfo()
>>>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
>>>> [7] "base"
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list