[Rd] [.data.frame speedup

Sat Jul 5 21:16:09 CEST 2008

>>>>> "TH" == Tim Hesterberg <timhesterberg at gmail.com>
>>>>>     on Thu, 3 Jul 2008 17:04:24 -0700 writes:

    TH> I made a couple of a changes from the previous version:
    TH> - don't use functions anyMissing or notSorted (which aren't in base R)
    TH> - don't check for dup.row.names attribute (need to modify other functions
    TH> before that is useful)
    TH> I have not tested this with a wide variety of inputs; I'm assuming that
    TH> you have some regression tests.

yes, we do (and they are part of the R source).

    TH> Here are the file differences.  Let me know if you'd like a different
    TH> format.


    TH> $ diff -c dataframe.R dataframe2.R
    TH> *** dataframe.R    Thu Jul  3 15:48:12 2008
    TH> --- dataframe2.R    Thu Jul  3 16:36:46 2008
    ...................

context diff is fine (I typically use '-u' but that's not important).

>From your patch,
I've currently ended in this "patch" :

--- dataframe.R.~19~	2008-07-03 02:13:21.000163000 +0200
+++ dataframe.R	2008-07-05 13:02:33.000029000 +0200
@@ -579,14 +579,18 @@
         ## row names might have NAs.
         if(is.null(rows)) rows <- attr(xx, "row.names")
         rows <- rows[i]
-	if((ina <- any(is.na(rows))) | (dup <- any(duplicated(rows)))) {
-	    ## both will coerce integer 'rows' to character:
-	    if (!dup && is.character(rows)) dup <- "NA" %in% rows
-	    if(ina)
-		rows[is.na(rows)] <- "NA"
-	    if(dup)
-		rows <- make.unique(as.character(rows))
-	}
+
+	## Do not want to check for duplicates if don't need to
+	noDuplicateRowNames <-
+	    (is.logical(i) || (li <- length(i)) < 2 ||
+	     (is.numeric(i) && (min(0, i, na.rm=TRUE) < 0 ||
+			       (!any(is.na(i)) && all(i[-li] < i[-1L])))))
+	## TODO: is.unsorted(., strict=FALSE/TRUE)
+	if(any(is.na(rows)))
+	    rows[is.na(rows)] <- "NA" # coerces to integer
+	if(!noDuplicateRowNames && any(duplicated(rows)))
+	    rows <- make.unique(as.character(rows)) # coerces to integer
+
         ## new in 1.8.0  -- might have duplicate columns
         if(any(duplicated(nm <- names(x)))) names(x) <- make.unique(nm)
         if(is.null(rows)) rows <- attr(xx, "row.names")[i]


    TH> Here's some code for testing, and timings

    .................

I've rationalized (wrote functions) and slightly extended your
tests, they are now public at
   ftp://ftp.stat.math.ethz.ch/U/maechler/R/data.frame-TH-ex.R

Unfortunately, they show that "the speedup" is negative in some
cases, e.g. for the 'i <- 1:n' case for n <- 1000 or 10000.
I've replicated every system.time() 12 times, to get a sense of
the precision, and that's still the conclusion.

In other words, your proposed  'noDuplicateRowNames'
computations are sometimes more expensive than the duplicated(.)
call they replace.

To me, that means that the whole exercise was probbaly in vain:
We are not making the code more complicated if it's not a
uniform improvement.  Too bad.....

Martin Maechler, ETH Zurich