[Rd] Partial matching performance in data frame rownames using [
Ivan Krylov
kry|ov@r00t @end|ng |rom gm@||@com
Sat Dec 16 10:48:42 CET 2023
On Wed, 13 Dec 2023 09:04:18 +0100
Hilmar Berger via R-devel <r-devel using r-project.org> wrote:
> Still, I feel that default partial matching cripples the functionality
> of data.frame for larger tables.
Changing the default now would require a long deprecation cycle to give
everyone who uses `[.data.frame` and relies on partial matching
(whether they know it or not) enough time to adjust.
Still, adding an argument feels like a small change: edit
https://svn.r-project.org/R/trunk/src/library/base/R/dataframe.R and
add a condition before calling pmatch(). Adjust the warning() for named
arguments. Don't forget to document the new argument in the man page at
https://svn.r-project.org/R/trunk/src/library/base/man/Extract.data.frame.Rd
Index: src/library/base/R/dataframe.R
===================================================================
--- src/library/base/R/dataframe.R (revision 85664)
+++ src/library/base/R/dataframe.R (working copy)
@@ -591,14 +591,14 @@
### These are a little less general than S
`[.data.frame` <-
- function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1)
+ function(x, i, j, drop = if(missing(i)) TRUE else length(cols) == 1, pmatch.rows = TRUE)
{
mdrop <- missing(drop)
Narg <- nargs() - !mdrop # number of arg from x,i,j that were specified
has.j <- !missing(j)
- if(!all(names(sys.call()) %in% c("", "drop"))
+ if(!all(names(sys.call()) %in% c("", "drop", "pmatch.rows"))
&& !isS4(x)) # at least don't warn for callNextMethod!
- warning("named arguments other than 'drop' are discouraged")
+ warning("named arguments other than 'drop', 'pmatch.rows' are discouraged")
if(Narg < 3L) { # list-like indexing or matrix indexing
if(!mdrop) warning("'drop' argument will be ignored")
@@ -679,7 +679,11 @@
## for consistency with [, <length-1>]
if(is.character(i)) {
rows <- attr(xx, "row.names")
- i <- pmatch(i, rows, duplicates.ok = TRUE)
+ i <- if (pmatch.rows) {
+ pmatch(i, rows, duplicates.ok = TRUE)
+ } else {
+ match(i, rows)
+ }
}
## need to figure which col was selected:
## cannot use .subset2 directly as that may
@@ -699,7 +703,11 @@
# as this can be expensive.
if(is.character(i)) {
rows <- attr(xx, "row.names")
- i <- pmatch(i, rows, duplicates.ok = TRUE)
+ i <- if (pmatch.rows) {
+ pmatch(i, rows, duplicates.ok = TRUE)
+ } else {
+ match(i, rows)
+ }
}
for(j in seq_along(x)) {
xj <- xx[[ sxx[j] ]]
Index: src/library/base/man/Extract.data.frame.Rd
===================================================================
--- src/library/base/man/Extract.data.frame.Rd (revision 85664)
+++ src/library/base/man/Extract.data.frame.Rd (working copy)
@@ -15,7 +15,7 @@
Extract or replace subsets of data frames.
}
\usage{
-\method{[}{data.frame}(x, i, j, drop = )
+\method{[}{data.frame}(x, i, j, drop =, pmatch.rows = TRUE)
\method{[}{data.frame}(x, i, j) <- value
\method{[[}{data.frame}(x, ..., exact = TRUE)
\method{[[}{data.frame}(x, i, j) <- value
@@ -45,6 +45,9 @@
column is selected.}
\item{exact}{logical: see \code{\link{[}}, and applies to column names.}
+
+ \item{pmatch.rows}{logical: whether to perform partial matching on
+ row names in case \code{i} is a character vector.}
}
\details{
Data frames can be indexed in several modes. When \code{[} and
system.time({r <- d1[q2,, drop=FALSE, pmatch.rows = FALSE]})
# user system elapsed
# 0.478 0.004 0.482
Unfortunately, that would be only the beginning. The prose in the whole
?`[.data.frame` would have to be updated; the new behaviour would have
to be tested in tests/**.R. There may be very good reasons why named
arguments to `[` other than drop= are discouraged for data.frames. I'm
afraid I lack the whole-project view to consider whether such an
addition would be safe.
--
Best regards,
Ivan
More information about the R-devel
mailing list