[Rd] extracting rows from a data frame by looping over the row names: performance issues
Wolfgang Huber
huber at ebi.ac.uk
Fri Mar 2 21:01:17 CET 2007
Hi Hervé
depending on your problem, using "mapply" might help, as in the code
example below:
a = data.frame(matrix(1:3e4, ncol=3))
print(system.time({
r1 = numeric(nrow(a))
for(i in seq_len(nrow(a))) {
g = a[i,]
r1[i] = mean(c(g$X1, g$X2, g$X3))
}}))
print(system.time({
f = function(X1,X2,X3) mean(c(X1, X2, X3))
r2 = do.call("mapply", args=append(f, a))
}))
print(identical(r1, r2))
# user system elapsed
6.049 0.200 6.987
user system elapsed
0.508 0.000 0.509
[1] TRUE
Best wishes
Wolfgang
Roger D. Peng wrote:
> Extracting rows from data frames is tricky, since each of the columns could be
> of a different class. For your toy example, it seems a matrix would be a more
> reasonable option.
>
> R-devel has some improvements to row extraction, if I remember correctly. You
> might want to try your example there.
>
> -roger
>
> Herve Pages wrote:
>> Hi,
>>
>>
>> I have a big data frame:
>>
>> > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
>> > dat <- as.data.frame(mat)
>>
>> and I need to do some computation on each row. Currently I'm doing this:
>>
>> > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... }
>>
>> which could probably considered a very natural (and R'ish) way of doing it
>> (but maybe I'm wrong and the real idiom for doing this is something different).
>>
>> The problem with this "idiomatic form" is that it is _very_ slow. The loop
>> itself + the simple extraction of the rows (no computation on the rows) takes
>> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>>
>> Looping over the first 100 rows takes 12 seconds:
>>
>> > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
>> user system elapsed
>> 12.637 0.120 12.756
>>
>> But if, instead of the above, I do this:
>>
>> > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>>
>> then it's 20 times faster!!
>>
>> > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
>> user system elapsed
>> 0.576 0.096 0.673
>>
>> I hope you will agree that this second form is much less natural.
>>
>> So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic
>> form be, not only elegant and easy to read, but also efficient?
>>
>>
>> Thanks,
>> H.
>>
>>
>>> sessionInfo()
>> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
>> [7] "base"
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
--
Best wishes
Wolfgang
------------------------------------------------------------------
Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
More information about the R-devel
mailing list