[Rd] extracting rows from a data frame by looping over the row names: performance issues
hpages at fhcrc.org
hpages at fhcrc.org
Sat Mar 3 18:28:22 CET 2007
Quoting hpages at fhcrc.org:
> In [.data.frame if you replace this:
>
> ...
> if (is.character(i)) {
> rows <- attr(xx, "row.names")
> i <- pmatch(i, rows, duplicates.ok = TRUE)
> }
> ...
>
> by this
>
> ...
> if (is.character(i)) {
> rows <- attr(xx, "row.names")
> if (typeof(rows) == "integer")
> i <- as.integer(i)
> else
> i <- pmatch(i, rows, duplicates.ok = TRUE)
> }
> ...
>
> then you get a huge boost:
>
> - with current [.data.frame
> > system.time(for (i in 1:100) dat["1", ])
> user system elapsed
> 34.994 1.084 37.915
>
> - with "patched" [.data.frame
> > system.time(for (i in 1:100) dat["1", ])
> user system elapsed
> 0.264 0.068 0.364
>
mmmh, replacing
i <- pmatch(i, rows, duplicates.ok = TRUE)
by just
i <- as.integer(i)
was a bit naive. It will be wrong if rows is not a "seq_len" sequence.
So I need to be more carefull by first calling 'match' to find the exact
matches and then calling 'pmatch' _only_ on those indices that don't have
an exact match. For example like doing something like this:
if (is.character(i)) {
rows <- attr(xx, "row.names")
if (typeof(rows) == "integer") {
i2 <- match(as.integer(i), rows)
if (any(is.na(i2)))
i2[is.na(i2)] <- pmatch(i[is.na(i2)], rows, duplicates.ok =
TRUE)
i <- i2
} else {
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
}
Correctness:
> dat2 <- data.frame(aa=c('a', 'b', 'c', 'd'), bb=1:4,
row.names=c(11,25,1,3))
> dat2
aa bb
11 a 1
25 b 2
1 c 3
3 d 4
> dat2["1",]
aa bb
1 c 3
> dat2["3",]
aa bb
3 d 4
> dat2["2",]
aa bb
25 b 2
Performance:
> mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5)
> dat <- as.data.frame(mat)
> system.time(for (i in 1:100) dat["1", ])
user system elapsed
2.036 0.880 2.917
Still 17 times faster than with non-patched [.data.frame.
Maybe 'pmatch(x, table, ...)' itself could be improved to be
more efficient when 'x' is a character vector and 'table' an
integer vector so the above trick is not needed anymore.
My point is that something can probably be done to improve the
performance of 'dat[i, ]' when the row names are integer and 'i'
a character vector. I'm assuming that, in the typical use-case,
there is an exact match for 'i' in the row names so converting
those row names to a character vector in order to find this match
is (most of the time) a waste of time.
Cheers,
H.
> but maybe I'm missing somethig...
>
> Cheers,
> H.
>
> >
> > If you assign character row names, things will be a bit faster:
> >
> > # before
> > system.time(for (i in 1:25) dat["2", ])
> > user system elapsed
> > 9.337 0.404 10.731
> >
> > # this looks funny, but has the desired result
> > rownames(dat) <- rownames(dat)
> > typeof(attr(dat, "row.names")
> >
> > # after
> > system.time(for (i in 1:25) dat["2", ])
> > user system elapsed
> > 0.343 0.226 0.608
> >
> > And you probably would have seen this if you had looked at the the
> > profiling data:
> >
> > Rprof()
> > for (i in 1:25) dat["2", ]
> > Rprof(NULL)
> > summaryRprof()
> >
> >
> > + seth
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list