[Rd] Inefficiency in df$col
Duncan Murdoch
murdoch@dunc@n @end|ng |rom gm@||@com
Mon Feb 4 02:41:16 CET 2019
On 03/02/2019 12:04 p.m., Radford Neal wrote:
> While doing some performance testing with the new version of pqR (see
> pqR-project.org), I've encountered an extreme, and quite unnecessary,
> inefficiency in the current R Core implementation of R, which I think
> you might want to correct.
>
> The inefficiency is in access to columns of a data frame, as in
> expressions such as df$col[i], which I think are very common (the
> alternatives of df[i,"col"] and df[["col"]][i] are, I think, less
> common).
>
> Here is the setup for an example showing the issue:
>
> L <- list (abc=1:9, xyz=11:19)
> Lc <- L; class(Lc) <- "glub"
> df <- data.frame(L)
>
> And here are some times for R-3.5.2 (r-devel of 2019-02-01 is much
> the same):
>
> > system.time (for (i in 1:1000000) r <- L$xyz)
> user system elapsed
> 0.086 0.004 0.089
> > system.time (for (i in 1:1000000) r <- Lc$xyz)
> user system elapsed
> 0.494 0.000 0.495
> > system.time (for (i in 1:1000000) r <- df$xyz)
> user system elapsed
> 3.425 0.000 3.426
>
> So accessing a column of a data frame is 38 times slower than
> accessing a list element (which is what happens in the underlying
> implementation of a data frame), and 7 times slower than accessing an
> element of a list with a class attribute (for which it's necessary to
> check whether there is a $.glub method, which there isn't here).
>
> For comparison, here are the times for pqR-2019-01-25:
>
> > system.time (for (i in 1:1000000) r <- L$xyz)
> user system elapsed
> 0.057 0.000 0.058
> > system.time (for (i in 1:1000000) r <- Lc$xyz)
> user system elapsed
> 0.251 0.000 0.251
> > system.time (for (i in 1:1000000) r <- df$xyz)
> user system elapsed
> 0.247 0.000 0.247
>
> So when accessing df$xyz, R-3.5.2 is 14 times slower than pqR-2019-01-25.
> (For a partial match, like df$xy, R-3.5.2 is 34 times slower.)
>
> I wasn't surprised that pqR was faster, but I didn't expect this big a
> difference. Then I remembered having seen a NEWS item from R-3.1.0:
>
> * Partial matching when using the $ operator _on data frames_ now
> throws a warning and may become defunct in the future. If partial
> matching is intended, replace foo$bar by foo[["bar", exact =
> FALSE]].
>
> and having looked at the code then:
>
> `$.data.frame` <- function(x,name) {
> a <- x[[name]]
> if (!is.null(a)) return(a)
>
> a <- x[[name, exact=FALSE]]
> if (!is.null(a)) warning("Name partially matched in data frame")
> return(a)
> }
>
> I recall thinking at the time that this involved a pretty big
> performance hit, compared to letting the primitive $ operator do it,
> just to produce a warning. But it wasn't until now that I noticed
> this NEWS in R-3.1.1:
>
> * The warning when using partial matching with the $ operator on
> data frames is now only given when
> options("warnPartialMatchDollar") is TRUE.
>
> for which the code was changed to:
>
> `$.data.frame` <- function(x,name) {
> a <- x[[name]]
> if (!is.null(a)) return(a)
>
> a <- x[[name, exact=FALSE]]
> if (!is.null(a) && getOption("warnPartialMatchDollar", default=FALSE)) {
> names <- names(x)
> warning(gettextf("Partial match of '%s' to '%s' in data frame",
> name, names[pmatch(name, names)]))
> }
> return(a)
> }
>
> One can see the effect now when warnPartialMatchDollar is enabled:
>
> > options(warnPartialMatchDollar=TRUE)
> > Lc$xy
> [1] 11 12 13 14 15 16 17 18 19
> Warning message:
> In Lc$xy : partial match of 'xy' to 'xyz'
> > df$xy
> [1] 11 12 13 14 15 16 17 18 19
> Warning message:
> In `$.data.frame`(df, xy) : Partial match of 'xy' to 'xyz' in data frame
>
> So the only thing that slowing down acesses like df$xyz by a factor of
> seven achieves now is to add the words "in data frame" to the warning
> message (while making the earlier part of the message less intelligible).
>
> I think you might want to just delete the definition of $.data.frame,
> reverting to the situation before R-3.1.0.
I imagine the cause is that the list version is done in C code rather
than R code (i.e. there's no R function `$.list`). So an alternative
solution would be to also implement `$.data.frame` in the underlying C
code. This won't be quite as fast (it needs that test for NULL), but
should be close in the full match case.
Duncan Murdoch
More information about the R-devel
mailing list