[Rd] extracting rows from a data frame by looping over the row names: performance issues
hadley wickham
h.wickham at gmail.com
Sat Mar 3 20:13:07 CET 2007
On 3/3/07, hpages at fhcrc.org <hpages at fhcrc.org> wrote:
> Quoting hpages at fhcrc.org:
> > In [.data.frame if you replace this:
> >
> > ...
> > if (is.character(i)) {
> > rows <- attr(xx, "row.names")
> > i <- pmatch(i, rows, duplicates.ok = TRUE)
> > }
> > ...
> >
> > by this
> >
> > ...
> > if (is.character(i)) {
> > rows <- attr(xx, "row.names")
> > if (typeof(rows) == "integer")
> > i <- as.integer(i)
> > else
> > i <- pmatch(i, rows, duplicates.ok = TRUE)
> > }
> > ...
> >
> > then you get a huge boost:
> >
> > - with current [.data.frame
> > > system.time(for (i in 1:100) dat["1", ])
> > user system elapsed
> > 34.994 1.084 37.915
> >
> > - with "patched" [.data.frame
> > > system.time(for (i in 1:100) dat["1", ])
> > user system elapsed
> > 0.264 0.068 0.364
> >
>
> mmmh, replacing
> i <- pmatch(i, rows, duplicates.ok = TRUE)
> by just
> i <- as.integer(i)
> was a bit naive. It will be wrong if rows is not a "seq_len" sequence.
>
> So I need to be more carefull by first calling 'match' to find the exact
> matches and then calling 'pmatch' _only_ on those indices that don't have
> an exact match. For example like doing something like this:
>
> if (is.character(i)) {
> rows <- attr(xx, "row.names")
> if (typeof(rows) == "integer") {
> i2 <- match(as.integer(i), rows)
> if (any(is.na(i2)))
> i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok =
> TRUE)
> i <- i2
> } else {
> i <- pmatch(i, rows, duplicates.ok = TRUE)
> }
> }
>
> Correctness:
>
> > dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4,
> row.names=c(11,25,1,3))
> > dat2
> aa bb
> 11 a 1
> 25 b 2
> 1 c 3
> 3 d 4
>
> > dat2["1",]
> aa bb
> 1 c 3
>
> > dat2["3",]
> aa bb
> 3 d 4
>
> > dat2["2",]
> aa bb
> 25 b 2
>
> Performance:
>
> > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
> > dat <- as.data.frame(mat)
> > system.time(for (i in 1:100) dat["1", ])
> user system elapsed
> 2.036 0.880 2.917
>
> Still 17 times faster than with non-patched [.data.frame.
>
> Maybe 'pmatch(x, table, ...)' itself could be improved to be
> more efficient when 'x' is a character vector and 'table' an
> integer vector so the above trick is not needed anymore.
>
> My point is that something can probably be done to improve the
> performance of 'dat[i, ]' when the row names are integer and 'i'
> a character vector. I'm assuming that, in the typical use-case,
> there is an exact match for 'i' in the row names so converting
> those row names to a character vector in order to find this match
> is (most of the time) a waste of time.
But why bother? If you know the index of the row, why not index with
a numeric vector rather than a string? The behaviour in that case
seems obvious and fast.
Hadley
More information about the R-devel
mailing list