[Rd] Partial matching performance in data frame rownames using [

Toby Hocking tdhock5 @end|ng |rom gm@||@com
Tue Dec 19 21:57:31 CET 2023


Hi Hilmar and Ivan,
I have used your code examples to write a blog post about this topic,
which has figures that show the asymptotic time complexity of the
various approaches,
https://tdhock.github.io/blog/2023/df-partial-match/
The asymptotic complexity of partial matching appears to be quadratic
O(N^2) whereas the other approaches are asymptotically faster: linear
O(N) or log-linear O(N log N).
I think that accepting Ivan's pmatch.rows patch would add un-necessary
complexity to base R, since base R already provides an efficient
work-around, d1[match(q1,rownames(d1)),]
I do think the CheckUserInterrupt patch is a good idea, though.
Best,
Toby

On Sat, Dec 16, 2023 at 2:49 AM Ivan Krylov <krylov.r00t using gmail.com> wrote:
>
> On Wed, 13 Dec 2023 09:04:18 +0100
> Hilmar Berger via R-devel <r-devel using r-project.org> wrote:
>
> > Still, I feel that default partial matching cripples the functionality
> > of data.frame for larger tables.
>
> Changing the default now would require a long deprecation cycle to give
> everyone who uses `[.data.frame` and relies on partial matching
> (whether they know it or not) enough time to adjust.
>
> Still, adding an argument feels like a small change: edit
> https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and
> add a condition before calling pmatch(). Adjust the warning() for named
> arguments. Don't forget to document the new argument in the man page at
> https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd
>
> Index: src/library/base/R/dataframe.R
> ===================================================================
> --- src/library/base/R/dataframe.R      (revision 85664)
> +++ src/library/base/R/dataframe.R      (working copy)
> @@ -591,14 +591,14 @@
>  ###  These are a little less general than S
>
>  `[.data.frame` <-
> -    function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1)
> +    function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE)
>  {
>      mdrop <- missing(drop)
>      Narg <- nargs() - !mdrop  # number of arg from x,i,j that were specified
>      has.j <- !missing(j)
> -    if(!all(names(sys.call()) %in% c("", "drop"))
> +    if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows"))
>         && !isS4(x)) # at least don't warn for callNextMethod!
> -        warning("named arguments other than 'drop' are discouraged")
> +        warning("named arguments other than 'drop', 'pmatch.rows' are discouraged")
>
>      if(Narg < 3L) {  # list-like indexing or matrix indexing
>          if(!mdrop) warning("'drop' argument will be ignored")
> @@ -679,7 +679,11 @@
>              ## for consistency with [, <length-1>]
>              if(is.character(i)) {
>                  rows <- attr(xx, "row.names")
> -                i <- pmatch(i, rows, duplicates.ok = TRUE)
> +                i <- if (pmatch.rows) {
> +                    pmatch(i, rows, duplicates.ok = TRUE)
> +                } else {
> +                    match(i, rows)
> +                }
>              }
>              ## need to figure which col was selected:
>              ## cannot use .subset2 directly as that may
> @@ -699,7 +703,11 @@
>                   # as this can be expensive.
>      if(is.character(i)) {
>          rows <- attr(xx, "row.names")
> -        i <- pmatch(i, rows, duplicates.ok = TRUE)
> +        i <- if (pmatch.rows) {
> +            pmatch(i, rows, duplicates.ok = TRUE)
> +        } else {
> +            match(i, rows)
> +        }
>      }
>      for(j in seq_along(x)) {
>          xj <- xx[[ sxx[j] ]]
> Index: src/library/base/man/Extract.data.frame.Rd
> ===================================================================
> --- src/library/base/man/Extract.data.frame.Rd  (revision 85664)
> +++ src/library/base/man/Extract.data.frame.Rd  (working copy)
> @@ -15,7 +15,7 @@
>    Extract or replace subsets of data frames.
>  }
>  \usage{
> -\method{[}{data.frame}(x, i, j, drop = )
> +\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE)
>  \method{[}{data.frame}(x, i, j) <- value
>  \method{[[}{data.frame}(x, ..., exact = TRUE)
>  \method{[[}{data.frame}(x, i, j) <- value
> @@ -45,6 +45,9 @@
>      column is selected.}
>
>     \item{exact}{logical: see \code{\link{[}}, and applies to column names.}
> +
> +   \item{pmatch.rows}{logical: whether to perform partial matching on
> +     row names in case \code{i} is a character vector.}
>  }
>  \details{
>    Data frames can be indexed in several modes.  When \code{[} and
>
>
> system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]})
> #    user  system elapsed
> #   0.478   0.004   0.482
>
> Unfortunately, that would be only the beginning. The prose in the whole
> ?`[.data.frame` would have to be updated; the new behaviour would have
> to be tested in tests/**.R. There may be very good reasons why named
> arguments to `[` other than drop= are discouraged for data.frames. I'm
> afraid I lack the whole-project view to consider whether such an
> addition would be safe.
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list