[Rd] [ subscripting sometimes loses names (PR#8192)

Sun Oct 9 21:04:44 CEST 2005

--rwEMma7ioTxnRzrJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

R, like recent versions of S-Plus, sometimes - but not always - loses
names when subscripting objects with "[".  (Earlier versions of S and
S-Plus had the correct, name-preserving behavior.)  This seems bad, it
would be better to remove names only by explicit request, not as an
accidental side-effect of some (but not all) subscripting operations.

This issue was also discusses back in 2001 on the S-News list:

  http://www.biostat.wustl.edu/archives/html/s-news/2001-09/msg00020.html

The attached file, "fix-names.s", is also available here:

  http://www.piskorski.com/R/patches/fix-names.s

It includes:

1. The function dtk.test.brace.names(), which demonstrates name losing
problem, and can automatically report which test cases pass/fail, etc.

2. Wrappers for the "[" and "[.data.frame" functions which fix the
losing names problem for all the cases I've tried.

Note that dtk.test.brace.names(T) will always run all its test cases
and return their output for human inspection.  However, its checks to
see whether each test passes or fails only work correctly with the
patched all.equal() in PR#8191.

My coworkers and I have been using these wrapper functions for ALL
code we run for many months now, with no problems so far.  However,
there are probably some cases we don't use, like objects with S4
classes, which don't work right with these wrappers.

I assume the R core team would NOT want to use these wrapper
functions, but would instead prefer to change the underlying code
directly.  However, I offer them as an example of one way to achieve
what we believe to be the correct name-preserving behavior in R.

I would appreciate any suggestions on how to better implement this
name-preserving behavior for all R subscripting operations.

-- 
Andrew Piskorski <atp at piskorski.com>
http://www.piskorski.com/

--rwEMma7ioTxnRzrJ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="fix-names.s"

dtk.null <- function(...) {}

#
# Fix loss of dimnames when subcripting with "[":
#
# According to Gary Sabot <gary at sabot.com>, S-Plus originally had the
# correct name-preserving behavior we want.  Then in 1996 Insightful
# broke that in Splus 3.4, which Gary fixed for his own use.  In 2001,
# Splus 6.0 broke something in Gary's fix, he posted questions to the
# s-news list, generating discussion, including some shock that anyone
# could think that arbitrarily losing dimnames made sense:
# 
#   http://www.biostat.wustl.edu/archives/html/s-news/2001-09/msg00020.html
#
# Unfortunately R seems to mimic much of the more recent buggy S-Plus
# behavior!  Thus our patches to both S-Plus and R below.  To test
# that they do the right thing, run dtk.test.brace.names().
#
# Note that currently these patches wrap the stock subscripting
# functions, they do NOT replace them.  TODO: Investigate fixing the
# stock implementation instead, especially for R.
#
# --atp at piskorski.com, 2005/09/27 17:03 EDT
#

if (!.R.) { # For S-Plus:

   # First make sure that if you run this twice, you still get the
   # real original function:
   data.frame.original.fcn <- get("[.data.frame", where="splus")

   # For S-Plus (at least version 6.2.1) we do not need to override
   # the "[" function, it already does the right thing.
   # --atp at piskorski.com, 2005/07/01 10:11 EDT

   "[.data.frame" <- function(x ,... ,drop=T) {

      # TODO: Problem: New lm() is dying with my patch, because it indexes something
      # and produces something that looks like it has 2 cols, but ncol returns 1.
      # I can detect/avoid this case since it has class = c("model.frame", "data.frame")
      # see test case below where I figured out this issue in case it needs further work.

      class.x <- class(x)
      caller <- sys.call(sys.parent())[[2]]
      if (length(class.x)==1 && class.x=="data.frame" &&
          (mode(caller) != "name" || (caller != "value"))) {
         # If caller is a name and it is "value", then it is the
         # lhs case that we just want the original fcn to handle:

         result <- data.frame.original.fcn(x, ..., drop=F)

         if (drop && length(ncol(result) > 0) && ncol(result)==1) {
            save.names <- dimnames(result)[[1]]
            #this approach works for factors too
            result <- result[[1]]
            names(result) <- save.names

            # TODO: Unfortunately still broken for objects with new
            # style classes, since it does not distinguish among
            # methods that have or do not have a getnames method.
            # library(missing) is an example: The multiple imputations
            # on an object get lost if subscripted with this function.
         } else {
            if (!missing(drop) && drop && length(nrow(result)) > 0 && nrow(result)==1) {
               #replicate documented behavior of [.data.frame: drop=T acts
               #differently then missing drop arg for this case!
               result <- as.list(result)
            } 
         }
      } else {
         result <- data.frame.original.fcn(x, ..., drop=F)     
      }
      result
   }

} else { # For R:
   # First make sure that if you run this twice, you still get the
   # real original function:

   # Also remove the obnoxious "drop argument will be ignored" warning
   # entirely from the function.  I would like to regsub out the whole
   # warning() call, but I can't seem to get that to work.  So, just
   # replace the first warning() call with a call to our dtk.null()
   # function which does nothing.  Fortunately, the warning() call we
   # want to get rid of is indeed the first (actually the only) one:
   # --atp at piskorski.com, 2005/07/01 17:53 EDT

   brace.original.fcn <- get("[",pos="package:base")
   data.frame.original.fcn.0 <- get("[.data.frame",pos="package:base")
   data.frame.original.fcn <- data.frame.original.fcn.0
   body(data.frame.original.fcn) <-
      parse(text=sub('warning(..?drop argument will be ignored..?)'  ,'dtk.null()'
              ,deparse(body(data.frame.original.fcn.0)) ,ignore.case=T))

   # For R (at least version 2.1.0) we need to override BOTH the
   # "[.data.frame" and "[" functions.
   # --atp at piskorski.com, 2005/07/01 10:11 EDT

   "[.data.frame" <- function(x ,i ,j ,... ,drop=T) {

      # The stock R default value for the drop arg is:
      #   drop=(if(missing(i)) TRUE else length(names(x)) == 1)
      # However, that DOES cause certain differences from S-Plus, so
      # we do NOT use it.  --atp at piskorski.com, 2005/07/01 13:18 EDT

      # TODO: Does above S-Plus problem with lm() also apply here?
      # --atp at piskorski.com, 2005/07/01 10:51 EDT

      class.x <- class(x)
      caller <- sys.call(sys.parent())[[2]]
      if (length(class.x)==1 && class.x=="data.frame" &&
          (mode(caller) != "name" || (caller != "value"))) {
         # If caller is a name and it is "value", then it is the
         # lhs case that we just want the original fcn to handle:

         code <- 'data.frame.original.fcn(x,'
         if (!missing(i))    code <- paste(code ,'i' ,sep="")
         if (length(dim(x)) > 1 && (missing(i) || length(dim(i)) <= 1))
            code <- paste(code ,',' ,sep="")
         if (!missing(j))    code <- paste(code ,'j' ,sep="")
         if (!missing(...))  code <- paste(code ,',...' ,sep="")
         code <- paste(code ,',drop=F' ,sep="")
         code <- paste(code ,')' ,sep="")
         #cat("Debug: code to eval:  ") ; print(code)
         result <- eval(parse(text=code))

         if (drop && length(ncol(result) > 0) && ncol(result)==1) {
            save.names <- dimnames(result)[[1]]
            #this approach works for factors too
            result <- result[[1]]
            names(result) <- save.names

            # TODO: Unfortunately still broken for objects with new
            # style classes, since it does not distinguish among
            # methods that have or do not have a getnames method.
            # library(missing) is an example: The multiple imputations
            # on an object get lost if subscripted with this function.
         } else {
            if (!missing(drop) && drop && length(nrow(result)) > 0 && nrow(result)==1) {
               #replicate documented behavior of [.data.frame: drop=T acts
               #differently then missing drop arg for this case!
               result <- as.list(result)
            } 
         }
      } else {
         if (missing(i))
            result <- data.frame.original.fcn(x ,... ,drop=F)
         else if (missing(j))
            result <- data.frame.original.fcn(x ,i ,... ,drop=F)
         else
            result <- data.frame.original.fcn(x ,i ,j ,... ,drop=F)
      }
      result
   }

   # R has this problem with NA names:
   # 
   #   # S-Plus 6.2.1:
   #   > foo <- c("a"=1,"b"=2,"c"=3)
   #   > foo[c("a","c","atp")]
   #    a c atp 
   #    1 3  NA
   # 
   #   # R 2.0.0, or 2.1.0:
   #   > foo <- c("a"=1,"b"=2,"c"=3)
   #   > foo[c("a","c","atp")]
   #      a    c <NA> 
   #      1    3   NA 
   #
   # This is very very bad, it causes soft "you just get different
   # results" bugs when running our nag.optimize() from R, and
   # probably in many other places in our code as well.
   # --atp at piskorski.com, 2005/06/30 17:40 EDT

   "[" <- function(x ,i ,j ,... ,drop=TRUE) {

      # We MUST be able to blank positional arguments, and it seems
      # that do.call() gives us no way to do that, so use eval():
      # --atp at piskorski.com, 2005/07/01 16:00 EDT

      code <- 'brace.original.fcn(x,'
      if (!missing(i))    code <- paste(code ,'i' ,sep="")
      if (length(dim(x)) > 1 && (missing(i) || length(dim(i)) <= 1))
         code <- paste(code ,',' ,sep="")
      if (!missing(j))    code <- paste(code ,'j' ,sep="")
      if (!missing(...))  code <- paste(code ,',...' ,sep="")
      if (!missing(drop)) code <- paste(code ,',drop=drop' ,sep="")
      code <- paste(code ,')' ,sep="")
      #cat("Debug: code to eval:  ") ; print(code)
      result <- eval(parse(text=code))

      # This fix is being really specific, it handles this:
      #      x[i]
      # But does not try to handle this:
      #      x[i, j, ... , drop = TRUE]
      # --atp at piskorski.com, 2005/06/30 17:40 EDT

      if (is.null(attributes(x)$class) && length(list(...)) == 0) {
         # Just a simple index operation on a vector or list:
         new.names <- names(result)
         if (length(new.names) > 0) {
            # Names are present:
            bad.names <- is.na(new.names)
            if (any(bad.names)) {
               # Some names are NA, so fix them:
               names(result)[bad.names] <- brace.original.fcn(i ,bad.names)
            }
         }
      }
      result
   }
}

dtk.test.brace.names <- function
(return.results.p=F  ,only="all") {
   # Some simple test cases for our patched "[.data.frame" and "[" functions:

   # Note that return.results.p=T will always work correctly, but
   # automatically deciding if those results are CORRECT will ONLY
   # work if using our patched all.equal() from PR#8191, which is
   # available here:
   #   http://r-bugs.biostat.ku.dk/cgi-bin/R/incoming?id=8191
   # --atp at piskorski.com, 2005/10/09 13:23 EDT

   ### Examples of correct output for the cases below:
   # > (c("a"=1,"b"=2,"c"=3)[c("a","c","no")])
   #  a c no 
   #  1 3 NA
   # > (cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6))[c(1,2),2])
   #  a b 
   #  4 5
   # > (cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6))[c(3,2,1),2])
   #  c b a 
   #  6 5 4
   # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[,2])
   #  a b 
   #  3 4
   # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[2,])
   #  col1 col2 
   #     2    4
   # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[2,,drop=T])
   #  col1 col2 
   #     2    4
   # > (cbind(col1=c(a=1,b=2) ,col2=c(3:4))[2,,drop=F])
   #   col1 col2 
   # b    2    4

   mm <- matrix(1:25 ,nrow=5 ,ncol=5 ,dimnames=list(letters[1:5],letters[6:10]))
   idx <- 1:(dim(mm)[1])
   mm.df <- as.data.frame(mm)

   results <-
      list( "vec.1"=list((c("a"=1,"b"=2,"c"=3)[c("a","c","no")])  ,c(a=1 ,c=3 ,no=NA) )
           ,"diag.1"=list(   mm[cbind(idx,idx)]  ,c(1,7,13,19,25) )
           ,"diag.2"=list(mm.df[cbind(idx,idx)]  ,c(1,7,13,19,25) )
           ,"df.a.1"=list((           cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)) [c(1,2),2])    ,c(a=4,b=5)                    ) 
           ,"df.b.1"=list((data.frame(cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)))[c(1,2),2])    ,c(a=4,b=5)                    ) 
           ,"df.a.2"=list((           cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)) [c(3,2,1),2])  ,c(c=6,b=5,a=4)                )
           ,"df.b.2"=list((data.frame(cbind(col1=c(a=1,b=2,c=3) ,col2=c(4:6)))[c(3,2,1),2])  ,c(c=6,b=5,a=4)                )
           ,"df.a.3"=list((           cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [,2])              ,c(a=3,b=4)                    )
           ,"df.b.3"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[,2])              ,c(a=3,b=4)                    )
           ,"df.a.4"=list((           cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [2,])              ,c(col1=2,col2=4)                     )
           ,"df.b.4"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[2,])              ,data.frame(col1=c(b=2) ,col2=c(b=4)) )
           ,"df.a.5"=list((           cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [2,,drop=T])       ,c(   col1=2,col2=4)           )
           ,"df.b.5"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[2,,drop=T])       ,list(col1=2,col2=4)           )
           ,"df.a.6"=list((           cbind(col1=c(a=1,b=2) ,col2=c(3:4)) [2,,drop=F])       ,cbind(col1=c(b=2) ,col2=c(4)) )
           ,"df.b.6"=list((data.frame(cbind(col1=c(a=1,b=2) ,col2=c(3:4)))[2,,drop=F])       ,data.frame(cbind(col1=c(b=2) ,col2=c(4))) )
           )

   # TODO: I don't know how to check these cases:
   # --atp at piskorski.com, 2005/07/01 10:34 EDT

   # > (data.frame(a=letters[1:3] ,b=2:4)[,1])
   # [1] a b c
   # > structure(.Data = c(1, 2, 3)  ,levels = c("a", "b", "c")  ,class = "factor"  ,names = c("1", "2", "3"))
   # [1] a b c
   # > dput(data.frame(a=letters[1:3], b=2:4)[,1])
   # structure(.Data = c(1, 2, 3)
   # , levels = c("a", "b", "c")
   # , class = "factor"
   # , names = c("1", "2", "3")
   # )

   # > library("missing")
   # > data.frame.original.fcn(cholesterolImpExample,,3)
   #  [1]  NA  NA  NA  NA  NA  NA  NA  NA  NA 156 242 256 142 216 248 168 236 200 264 264
   # [21] 188 182 294 214 198 256 280 204
   # 
   # miVariable object with 5 sets of multiple imputations 
   #                  1        2        3        4        5 
   #  2.chol14 190.7459 209.3937 220.1499 213.2871 218.0670
   #  4.chol14 156.6425 101.6117 173.6432 129.3747 140.9941
   #  5.chol14 176.8996 257.0628 157.2997 227.3769 173.9429
   # 10.chol14 255.8360 275.2017 284.1347 257.6721 289.8643
   # 13.chol14 252.9045 209.9261 257.6222 228.4710 270.7223
   # 16.chol14 298.6389 252.4810 259.4355 332.3279 287.4087
   # 18.chol14 180.9732 204.0033 194.7192 199.9087 200.9603
   # 23.chol14 219.6375 216.8488 125.2673 213.3161 263.5186
   # 25.chol14 268.8816 289.2164 229.5734 273.3548 240.9536
   # > 
   # > cholesterolImpExample[,3]
   #   2  4  5 10 13 16 18 23 25   1   3   6   7   8   9  11  12  14  15  17  19  20  21 
   #  NA NA NA NA NA NA NA NA NA 156 242 256 142 216 248 168 236 200 264 264 188 182 294
   # 
   #   22  24  26  27  28 
   #  214 198 256 280 204
   # 
   # miVariable object with 5 sets of multiple imputations 
   #                  1        2        3        4        5 
   #  2.chol14 190.7459 209.3937 220.1499 213.2871 218.0670
   #  4.chol14 156.6425 101.6117 173.6432 129.3747 140.9941
   #  5.chol14 176.8996 257.0628 157.2997 227.3769 173.9429
   # 10.chol14 255.8360 275.2017 284.1347 257.6721 289.8643
   # 13.chol14 252.9045 209.9261 257.6222 228.4710 270.7223
   # 16.chol14 298.6389 252.4810 259.4355 332.3279 287.4087
   # 18.chol14 180.9732 204.0033 194.7192 199.9087 200.9603
   # 23.chol14 219.6375 216.8488 125.2673 213.3161 263.5186
   # 25.chol14 268.8816 289.2164 229.5734 273.3548 240.9536
   # > 

   tt <- rep(F ,times=length(results)) ; names(tt) <- names(results)
   for (jj in 1:length(results)) {
      tmp <- all.equal(results[[jj]][[1]] ,results[[jj]][[2]])
      tt[jj] <- (length(tmp) == 1 && tmp == T)
      results[[jj]][3] <- list("pass"=as.vector(tt[jj]))
   }

   if (return.results.p) {
      if (only == "bad")
         results[!tt]
      else if (only == "good")
         results[tt]
      else results
   } else if (all(tt)) T else F
}

--rwEMma7ioTxnRzrJ--