[R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)

Fri Aug 3 22:33:59 CEST 2007

> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sent: Fri 8/3/2007 1:05 PM
> To: Steven McKinney
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
>  
> I've since seen your followup a more detailed explanation may help.
> The path through the code for your argument list does not go where you 
> quoted, and there is a reason for it.
> 

Using a copy of  "[.data.frame" with browser() I have traced
the flow of execution. (My copy with the browser command is at the end
of this email)

  > foo[, "FileName"]
  Called from: `[.data.frame`(foo, , "FileName")
  Browse[1]> n
  debug: mdrop <- missing(drop)
  Browse[1]> n
  debug: Narg <- nargs() - (!mdrop)
  Browse[1]> n
  debug: if (Narg < 3) {
      if (!mdrop) 
          warning("drop argument will be ignored")
      if (missing(i)) 
          return(x)
      if (is.matrix(i)) 
          return(as.matrix(x)[i])
      y <- NextMethod("[")
      cols <- names(y)
      if (!is.null(cols) && any(is.na(cols))) 
          stop("undefined columns selected")
      if (any(duplicated(cols))) 
          names(y) <- make.unique(cols)
      return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 
          0L)))
  }
  Browse[1]> n
  debug: if (missing(i)) {
      if (missing(j) && drop && length(x) == 1L) 
          return(.subset2(x, 1L))
      y <- if (missing(j)) 
          x
      else .subset(x, j)
      if (drop && length(y) == 1L) 
          return(.subset2(y, 1L))
      cols <- names(y)
      if (any(is.na(cols))) 
          stop("undefined columns selected")
      if (any(duplicated(cols))) 
          names(y) <- make.unique(cols)
      nrow <- .row_names_info(x, 2L)
      if (drop && !mdrop && nrow == 1L) 
          return(structure(y, class = NULL, row.names = NULL))
      else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 
          0L)))
  }
  Browse[1]> n
  debug: if (missing(j) && drop && length(x) == 1L) return(.subset2(x, 
      1L))
  Browse[1]> n
  debug: y <- if (missing(j)) x else .subset(x, j)
  Browse[1]> n
  debug: if (drop && length(y) == 1L) return(.subset2(y, 1L))
  Browse[1]> n
  NULL
  > 

So `[.data.frame` is exiting after executing 
+         if (drop && length(y) == 1L) 
+             return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done.  Is this intended?

Couldn't the error check
+         cols <- names(y)
+         if (any(is.na(cols))) 
+             stop("undefined columns selected")
be done before the above return()?

What would break if the error check on column names was done
before returning a NULL result due to incorrect column name spelling?

Why should

> foo[, "FileName"]
NULL

differ from

> foo[seq(nrow(foo)), "FileName"]
Error in `[.data.frame`(foo, seq(nrow(foo)), "FileName") : 
	undefined columns selected
> 

Thank you for your explanations.

> Generally when you extract in R and ask for an non-existent index you get 
> NA or NULL as the result (and no warning), e.g.
> 
> > y <- list(x=1, y=2)
> > y[["z"]]
> NULL
> 
> Because data frames 'must' have (column) names, they are a partial 
> exception and when the result is a data frame you get an error if it would 
> contain undefined columns.
> 
> But in the case of foo[, "FileName"], the result is a single column and so 
> will not have a name: there seems no reason to be different from
> 
> > foo[["FileName"]]
> NULL
> > foo$FileName
> NULL
> 
> which similarly select a single column.  At one time they were different 
> in R, for no documented reason.
> 
> 
> On Fri, 3 Aug 2007, Prof Brian Ripley wrote:
> 
> > You are reading the wrong part of the code for your argument list:
> >
> >>  foo["FileName"]
> > Error in `[.data.frame`(foo, "FileName") : undefined columns selected
> >
> > [.data.frame is one of the most complex functions in R, and does many 
> > different things depending on which arguments are supplied.
> >
> >
> > On Fri, 3 Aug 2007, Steven McKinney wrote:
> >
> >> Hi all,
> >> 
> >> What are current methods people use in R to identify
> >> mis-spelled column names when selecting columns
> >> from a data frame?
> >> 
> >> Alice Johnson recently tackled this issue
> >> (see [BioC] posting below).
> >> 
> >> Due to a mis-spelled column name ("FileName"
> >> instead of "Filename") which produced no warning,
> >> Alice spent a fair amount of time tracking down
> >> this bug.  With my fumbling fingers I'll be tracking
> >> down such a bug soon too.
> >> 
> >> Is there any options() setting, or debug technique
> >> that will flag data frame column extractions that
> >> reference a non-existent column?  It seems to me
> >> that the "[.data.frame" extractor used to throw an
> >> error if given a mis-spelled variable name, and I
> >> still see lines of code in "[.data.frame" such as
> >> 
> >> if (any(is.na(cols)))
> >>            stop("undefined columns selected")
> >> 
> >> 
> >> 
> >> In R 2.5.1 a NULL is silently returned.
> >> 
> >>> foo <- data.frame(Filename = c("a", "b"))
> >>> foo[, "FileName"]
> >> NULL
> >> 
> >> Has something changed so that the code lines
> >> if (any(is.na(cols)))
> >>            stop("undefined columns selected")
> >> in "[.data.frame" no longer work properly (if
> >> I am understanding the intention properly)?
> >> 
> >> If not, could  "[.data.frame" check an
> >> options() variable setting (say
> >> warn.undefined.colnames) and throw a warning
> >> if a non-existent column name is referenced?
> >> 
> >> 
> >> 
> >> 
> >>> sessionInfo()
> >> R version 2.5.1 (2007-06-27)
> >> powerpc-apple-darwin8.9.1
> >> 
> >> locale:
> >> en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
> >> 
> >> attached base packages:
> >> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods" 
> >> "base"
> >> 
> >> other attached packages:
> >>     plotrix         lme4       Matrix      lattice
> >>     "2.2-3"  "0.99875-4" "0.999375-0"     "0.16-2"
> >>> 
> >> 
> >> 
> >> 
> >> Steven McKinney
> >> 
> >> Statistician
> >> Molecular Oncology and Breast Cancer Program
> >> British Columbia Cancer Research Centre
> >> 
> >> email: smckinney +at+ bccrc +dot+ ca
> >> 
> >> tel: 604-675-8000 x7561
> >> 
> >> BCCRC
> >> Molecular Oncology
> >> 675 West 10th Ave, Floor 4
> >> Vancouver B.C.
> >> V5Z 1L3
> >> Canada
> >> 
> >> 
> 
> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> 
> 

> "[.data.frame" <- 
+ Function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 
+     1) 
+ {
+ browser()
+     mdrop <- missing(drop)
+     Narg <- nargs() - (!mdrop)
+     if (Narg < 3) {
+         if (!mdrop) 
+             warning("drop argument will be ignored")
+         if (missing(i)) 
+             return(x)
+         if (is.matrix(i)) 
+             return(as.matrix(x)[i])
+         y <- NextMethod("[")
+         cols <- names(y)
+         if (!is.null(cols) && any(is.na(cols))) 
+             stop("undefined columns selected")
+         if (any(duplicated(cols))) 
+             names(y) <- make.unique(cols)
+         return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 
+             0L)))
+     }
+     if (missing(i)) {
+         if (missing(j) && drop && length(x) == 1L) 
+             return(.subset2(x, 1L))
+         y <- if (missing(j)) 
+             x
+         else .subset(x, j)
+         if (drop && length(y) == 1L) 
+             return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done.  Is this intended?
+         cols <- names(y)
+         if (any(is.na(cols))) 
+             stop("undefined columns selected")
+         if (any(duplicated(cols))) 
+             names(y) <- make.unique(cols)
+         nrow <- .row_names_info(x, 2L)
+         if (drop && !mdrop && nrow == 1L) 
+             return(structure(y, class = NULL, row.names = NULL))
+         else return(structure(y, class = oldClass(x), row.names = .row_names_info(x, 
+             0L)))
+     }
+     xx <- x
+     cols <- names(xx)
+     x <- vector("list", length(x))
+     x <- .Call("R_copyDFattr", xx, x, PACKAGE = "base")
+     oldClass(x) <- attr(x, "row.names") <- NULL
+     if (!missing(j)) {
+         x <- x[j]
+         cols <- names(x)
+         if (any(is.na(cols))) 
+             stop("undefined columns selected")
+         nxx <- structure(seq_along(xx), names = names(xx))
+         sxx <- match(nxx[j], seq_along(xx))
+     }
+     else sxx <- seq_along(x)
+     rows <- NULL
+     if (is.character(i)) {
+         rows <- attr(xx, "row.names")
+         i <- pmatch(i, rows, duplicates.ok = TRUE)
+     }
+     for (j in seq_along(x)) {
+         xj <- xx[[sxx[j]]]
+         x[[j]] <- if (length(dim(xj)) != 2L) 
+             xj[i]
+         else xj[i, , drop = FALSE]
+     }
+     if (drop) {
+         n <- length(x)
+         if (n == 1L) 
+             return(x[[1L]])
+         if (n > 1L) {
+             xj <- x[[1L]]
+             nrow <- if (length(dim(xj)) == 2L) 
+                 dim(xj)[1L]
+             else length(xj)
+             drop <- !mdrop && nrow == 1L
+         }
+         else drop <- FALSE
+     }
+     if (!drop) {
+         if (is.null(rows)) 
+             rows <- attr(xx, "row.names")
+         rows <- rows[i]
+         if ((ina <- any(is.na(rows))) | (dup <- any(duplicated(rows)))) {
+             if (ina) 
+                 rows[is.na(rows)] <- "NA"
+             if (dup) 
+                 rows <- make.unique(as.character(rows))
+         }
+         if (any(duplicated(nm <- names(x)))) 
+             names(x) <- make.unique(nm)
+         if (is.null(rows)) 
+             rows <- attr(xx, "row.names")[i]
+         attr(x, "row.names") <- rows
+         oldClass(x) <- oldClass(xx)
+     }
+     x
+ }
>