[Rd] Inefficiency in df$col
Radford Neal
r@d|ord @end|ng |rom c@@toronto@edu
Sun Feb 3 18:04:55 CET 2019
While doing some performance testing with the new version of pqR (see
pqR-project.org), I've encountered an extreme, and quite unnecessary,
inefficiency in the current R Core implementation of R, which I think
you might want to correct.
The inefficiency is in access to columns of a data frame, as in
expressions such as df$col[i], which I think are very common (the
alternatives of df[i,"col"] and df[["col"]][i] are, I think, less
common).
Here is the setup for an example showing the issue:
L <- list (abc=1:9, xyz=11:19)
Lc <- L; class(Lc) <- "glub"
df <- data.frame(L)
And here are some times for R-3.5.2 (r-devel of 2019-02-01 is much
the same):
> system.time (for (i in 1:1000000) r <- L$xyz)
user system elapsed
0.086 0.004 0.089
> system.time (for (i in 1:1000000) r <- Lc$xyz)
user system elapsed
0.494 0.000 0.495
> system.time (for (i in 1:1000000) r <- df$xyz)
user system elapsed
3.425 0.000 3.426
So accessing a column of a data frame is 38 times slower than
accessing a list element (which is what happens in the underlying
implementation of a data frame), and 7 times slower than accessing an
element of a list with a class attribute (for which it's necessary to
check whether there is a $.glub method, which there isn't here).
For comparison, here are the times for pqR-2019-01-25:
> system.time (for (i in 1:1000000) r <- L$xyz)
user system elapsed
0.057 0.000 0.058
> system.time (for (i in 1:1000000) r <- Lc$xyz)
user system elapsed
0.251 0.000 0.251
> system.time (for (i in 1:1000000) r <- df$xyz)
user system elapsed
0.247 0.000 0.247
So when accessing df$xyz, R-3.5.2 is 14 times slower than pqR-2019-01-25.
(For a partial match, like df$xy, R-3.5.2 is 34 times slower.)
I wasn't surprised that pqR was faster, but I didn't expect this big a
difference. Then I remembered having seen a NEWS item from R-3.1.0:
* Partial matching when using the $ operator _on data frames_ now
throws a warning and may become defunct in the future. If partial
matching is intended, replace foo$bar by foo[["bar", exact =
FALSE]].
and having looked at the code then:
`$.data.frame` <- function(x,name) {
a <- x[[name]]
if (!is.null(a)) return(a)
a <- x[[name, exact=FALSE]]
if (!is.null(a)) warning("Name partially matched in data frame")
return(a)
}
I recall thinking at the time that this involved a pretty big
performance hit, compared to letting the primitive $ operator do it,
just to produce a warning. But it wasn't until now that I noticed
this NEWS in R-3.1.1:
* The warning when using partial matching with the $ operator on
data frames is now only given when
options("warnPartialMatchDollar") is TRUE.
for which the code was changed to:
`$.data.frame` <- function(x,name) {
a <- x[[name]]
if (!is.null(a)) return(a)
a <- x[[name, exact=FALSE]]
if (!is.null(a) && getOption("warnPartialMatchDollar", default=FALSE)) {
names <- names(x)
warning(gettextf("Partial match of '%s' to '%s' in data frame",
name, names[pmatch(name, names)]))
}
return(a)
}
One can see the effect now when warnPartialMatchDollar is enabled:
> options(warnPartialMatchDollar=TRUE)
> Lc$xy
[1] 11 12 13 14 15 16 17 18 19
Warning message:
In Lc$xy : partial match of 'xy' to 'xyz'
> df$xy
[1] 11 12 13 14 15 16 17 18 19
Warning message:
In `$.data.frame`(df, xy) : Partial match of 'xy' to 'xyz' in data frame
So the only thing that slowing down acesses like df$xyz by a factor of
seven achieves now is to add the words "in data frame" to the warning
message (while making the earlier part of the message less intelligible).
I think you might want to just delete the definition of $.data.frame,
reverting to the situation before R-3.1.0.
Radford Neal
More information about the R-devel
mailing list