[Rd] Inefficiency in df$col

Mon Feb 4 02:41:16 CET 2019

On 03/02/2019 12:04 p.m., Radford Neal wrote:
> While doing some performance testing with the new version of pqR (see
> pqR-project.org), I've encountered an extreme, and quite unnecessary,
> inefficiency in the current R Core implementation of R, which I think
> you might want to correct.
> 
> The inefficiency is in access to columns of a data frame, as in
> expressions such as df$col[i], which I think are very common (the
> alternatives of df[i,"col"] and df[["col"]][i] are, I think, less
> common).
> 
> Here is the setup for an example showing the issue:
> 
>    L <- list (abc=1:9, xyz=11:19)
>    Lc <- L; class(Lc) <- "glub"
>    df <- data.frame(L)
> 
> And here are some times for R-3.5.2 (r-devel of 2019-02-01 is much
> the same):
> 
>    > system.time (for (i in 1:1000000) r <- L$xyz)
>       user  system elapsed
>      0.086   0.004   0.089
>    > system.time (for (i in 1:1000000) r <- Lc$xyz)
>       user  system elapsed
>      0.494   0.000   0.495
>    > system.time (for (i in 1:1000000) r <- df$xyz)
>       user  system elapsed
>      3.425   0.000   3.426
> 
> So accessing a column of a data frame is 38 times slower than
> accessing a list element (which is what happens in the underlying
> implementation of a data frame), and 7 times slower than accessing an
> element of a list with a class attribute (for which it's necessary to
> check whether there is a $.glub method, which there isn't here).
> 
> For comparison, here are the times for pqR-2019-01-25:
> 
>    > system.time (for (i in 1:1000000) r <- L$xyz)
>       user  system elapsed
>      0.057   0.000   0.058
>    > system.time (for (i in 1:1000000) r <- Lc$xyz)
>       user  system elapsed
>      0.251   0.000   0.251
>    > system.time (for (i in 1:1000000) r <- df$xyz)
>       user  system elapsed
>      0.247   0.000   0.247
> 
> So when accessing df$xyz, R-3.5.2 is 14 times slower than pqR-2019-01-25.
> (For a partial match, like df$xy, R-3.5.2 is 34 times slower.)
> 
> I wasn't surprised that pqR was faster, but I didn't expect this big a
> difference.  Then I remembered having seen a NEWS item from R-3.1.0:
> 
>    * Partial matching when using the $ operator _on data frames_ now
>      throws a warning and may become defunct in the future. If partial
>      matching is intended, replace foo$bar by foo[["bar", exact =
>      FALSE]].
> 
> and having looked at the code then:
> 
>    `$.data.frame` <- function(x,name) {
>      a <- x[[name]]
>      if (!is.null(a)) return(a)
>    
>      a <- x[[name, exact=FALSE]]
>      if (!is.null(a)) warning("Name partially matched in data frame")
>      return(a)
>    }
> 
> I recall thinking at the time that this involved a pretty big
> performance hit, compared to letting the primitive $ operator do it,
> just to produce a warning.  But it wasn't until now that I noticed
> this NEWS in R-3.1.1:
> 
>    * The warning when using partial matching with the $ operator on
>      data frames is now only given when
>      options("warnPartialMatchDollar") is TRUE.
> 
> for which the code was changed to:
> 
>    `$.data.frame` <- function(x,name) {
>      a <- x[[name]]
>      if (!is.null(a)) return(a)
>    
>      a <- x[[name, exact=FALSE]]
>      if (!is.null(a) && getOption("warnPartialMatchDollar", default=FALSE)) {
>            names <- names(x)
>            warning(gettextf("Partial match of '%s' to '%s' in data frame",
>                                       name, names[pmatch(name, names)]))
>      }
>      return(a)
>    }
> 
> One can see the effect now when warnPartialMatchDollar is enabled:
> 
>    > options(warnPartialMatchDollar=TRUE)
>    > Lc$xy
>    [1] 11 12 13 14 15 16 17 18 19
>    Warning message:
>    In Lc$xy : partial match of 'xy' to 'xyz'
>    > df$xy
>    [1] 11 12 13 14 15 16 17 18 19
>    Warning message:
>    In `$.data.frame`(df, xy) : Partial match of 'xy' to 'xyz' in data frame
> 
> So the only thing that slowing down acesses like df$xyz by a factor of
> seven achieves now is to add the words "in data frame" to the warning
> message (while making the earlier part of the message less intelligible).
> 
> I think you might want to just delete the definition of $.data.frame,
> reverting to the situation before R-3.1.0.

I imagine the cause is that the list version is done in C code rather 
than R code (i.e. there's no R function `$.list`).  So an alternative 
solution would be to also implement `$.data.frame` in the underlying C 
code.  This won't be quite as fast (it needs that test for NULL), but 
should be close in the full match case.

Duncan Murdoch