[Rd] extracting rows from a data frame by looping over the row names: performance issues
Roger D. Peng
rdpeng at gmail.com
Fri Mar 2 20:43:07 CET 2007
Extracting rows from data frames is tricky, since each of the columns could be
of a different class. For your toy example, it seems a matrix would be a more
reasonable option.
R-devel has some improvements to row extraction, if I remember correctly. You
might want to try your example there.
-roger
Herve Pages wrote:
> Hi,
>
>
> I have a big data frame:
>
> > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
> > dat <- as.data.frame(mat)
>
> and I need to do some computation on each row. Currently I'm doing this:
>
> > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... }
>
> which could probably considered a very natural (and R'ish) way of doing it
> (but maybe I'm wrong and the real idiom for doing this is something different).
>
> The problem with this "idiomatic form" is that it is _very_ slow. The loop
> itself + the simple extraction of the rows (no computation on the rows) takes
> 10 hours on a powerful server (quad core Linux with 8G of RAM)!
>
> Looping over the first 100 rows takes 12 seconds:
>
> > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] })
> user system elapsed
> 12.637 0.120 12.756
>
> But if, instead of the above, I do this:
>
> > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) }
>
> then it's 20 times faster!!
>
> > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) })
> user system elapsed
> 0.576 0.096 0.673
>
> I hope you will agree that this second form is much less natural.
>
> So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic
> form be, not only elegant and easy to read, but also efficient?
>
>
> Thanks,
> H.
>
>
>> sessionInfo()
> R version 2.5.0 Under development (unstable) (2007-01-05 r40386)
> x86_64-unknown-linux-gnu
>
> locale:
> LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
> [7] "base"
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Roger D. Peng | http://www.biostat.jhsph.edu/~rpeng/
More information about the R-devel
mailing list