[R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
Steven McKinney
smckinney at bccrc.ca
Fri Aug 3 22:33:59 CEST 2007
> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sent: Fri 8/3/2007 1:05 PM
> To: Steven McKinney
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
>
> I've since seen your followup a more detailed explanation may help.
> The path through the code for your argument list does not go where you
> quoted, and there is a reason for it.
>
Using a copy of "[.data.frame" with browser() I have traced
the flow of execution. (My copy with the browser command is at the end
of this email)
> foo[, "FileName"]
Called from: `[.data.frame`(foo, , "FileName")
Browse[1]> n
debug: mdrop <- missing(drop)
Browse[1]> n
debug: Narg <- nargs() - (!mdrop)
Browse[1]> n
debug: if (Narg < 3) {
if (!mdrop)
warning("drop argument will be ignored")
if (missing(i))
return(x)
if (is.matrix(i))
return(as.matrix(x)[i])
y <- NextMethod("[")
cols <- names(y)
if (!is.null(cols) && any(is.na(cols)))
stop("undefined columns selected")
if (any(duplicated(cols)))
names(y) <- make.unique(cols)
return(structure(y, class = oldClass(x), row.names = .row_names_info(x,
0L)))
}
Browse[1]> n
debug: if (missing(i)) {
if (missing(j) && drop && length(x) == 1L)
return(.subset2(x, 1L))
y <- if (missing(j))
x
else .subset(x, j)
if (drop && length(y) == 1L)
return(.subset2(y, 1L))
cols <- names(y)
if (any(is.na(cols)))
stop("undefined columns selected")
if (any(duplicated(cols)))
names(y) <- make.unique(cols)
nrow <- .row_names_info(x, 2L)
if (drop && !mdrop && nrow == 1L)
return(structure(y, class = NULL, row.names = NULL))
else return(structure(y, class = oldClass(x), row.names = .row_names_info(x,
0L)))
}
Browse[1]> n
debug: if (missing(j) && drop && length(x) == 1L) return(.subset2(x,
1L))
Browse[1]> n
debug: y <- if (missing(j)) x else .subset(x, j)
Browse[1]> n
debug: if (drop && length(y) == 1L) return(.subset2(y, 1L))
Browse[1]> n
NULL
>
So `[.data.frame` is exiting after executing
+ if (drop && length(y) == 1L)
+ return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done. Is this intended?
Couldn't the error check
+ cols <- names(y)
+ if (any(is.na(cols)))
+ stop("undefined columns selected")
be done before the above return()?
What would break if the error check on column names was done
before returning a NULL result due to incorrect column name spelling?
Why should
> foo[, "FileName"]
NULL
differ from
> foo[seq(nrow(foo)), "FileName"]
Error in `[.data.frame`(foo, seq(nrow(foo)), "FileName") :
undefined columns selected
>
Thank you for your explanations.
> Generally when you extract in R and ask for an non-existent index you get
> NA or NULL as the result (and no warning), e.g.
>
> > y <- list(x=1, y=2)
> > y[["z"]]
> NULL
>
> Because data frames 'must' have (column) names, they are a partial
> exception and when the result is a data frame you get an error if it would
> contain undefined columns.
>
> But in the case of foo[, "FileName"], the result is a single column and so
> will not have a name: there seems no reason to be different from
>
> > foo[["FileName"]]
> NULL
> > foo$FileName
> NULL
>
> which similarly select a single column. At one time they were different
> in R, for no documented reason.
>
>
> On Fri, 3 Aug 2007, Prof Brian Ripley wrote:
>
> > You are reading the wrong part of the code for your argument list:
> >
> >> foo["FileName"]
> > Error in `[.data.frame`(foo, "FileName") : undefined columns selected
> >
> > [.data.frame is one of the most complex functions in R, and does many
> > different things depending on which arguments are supplied.
> >
> >
> > On Fri, 3 Aug 2007, Steven McKinney wrote:
> >
> >> Hi all,
> >>
> >> What are current methods people use in R to identify
> >> mis-spelled column names when selecting columns
> >> from a data frame?
> >>
> >> Alice Johnson recently tackled this issue
> >> (see [BioC] posting below).
> >>
> >> Due to a mis-spelled column name ("FileName"
> >> instead of "Filename") which produced no warning,
> >> Alice spent a fair amount of time tracking down
> >> this bug. With my fumbling fingers I'll be tracking
> >> down such a bug soon too.
> >>
> >> Is there any options() setting, or debug technique
> >> that will flag data frame column extractions that
> >> reference a non-existent column? It seems to me
> >> that the "[.data.frame" extractor used to throw an
> >> error if given a mis-spelled variable name, and I
> >> still see lines of code in "[.data.frame" such as
> >>
> >> if (any(is.na(cols)))
> >> stop("undefined columns selected")
> >>
> >>
> >>
> >> In R 2.5.1 a NULL is silently returned.
> >>
> >>> foo <- data.frame(Filename = c("a", "b"))
> >>> foo[, "FileName"]
> >> NULL
> >>
> >> Has something changed so that the code lines
> >> if (any(is.na(cols)))
> >> stop("undefined columns selected")
> >> in "[.data.frame" no longer work properly (if
> >> I am understanding the intention properly)?
> >>
> >> If not, could "[.data.frame" check an
> >> options() variable setting (say
> >> warn.undefined.colnames) and throw a warning
> >> if a non-existent column name is referenced?
> >>
> >>
> >>
> >>
> >>> sessionInfo()
> >> R version 2.5.1 (2007-06-27)
> >> powerpc-apple-darwin8.9.1
> >>
> >> locale:
> >> en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
> >>
> >> attached base packages:
> >> [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
> >> "base"
> >>
> >> other attached packages:
> >> plotrix lme4 Matrix lattice
> >> "2.2-3" "0.99875-4" "0.999375-0" "0.16-2"
> >>>
> >>
> >>
> >>
> >> Steven McKinney
> >>
> >> Statistician
> >> Molecular Oncology and Breast Cancer Program
> >> British Columbia Cancer Research Centre
> >>
> >> email: smckinney +at+ bccrc +dot+ ca
> >>
> >> tel: 604-675-8000 x7561
> >>
> >> BCCRC
> >> Molecular Oncology
> >> 675 West 10th Ave, Floor 4
> >> Vancouver B.C.
> >> V5Z 1L3
> >> Canada
> >>
> >>
>
>
> --
> Brian D. Ripley, ripley at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
>
>
> "[.data.frame" <-
+ Function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
+ 1)
+ {
+ browser()
+ mdrop <- missing(drop)
+ Narg <- nargs() - (!mdrop)
+ if (Narg < 3) {
+ if (!mdrop)
+ warning("drop argument will be ignored")
+ if (missing(i))
+ return(x)
+ if (is.matrix(i))
+ return(as.matrix(x)[i])
+ y <- NextMethod("[")
+ cols <- names(y)
+ if (!is.null(cols) && any(is.na(cols)))
+ stop("undefined columns selected")
+ if (any(duplicated(cols)))
+ names(y) <- make.unique(cols)
+ return(structure(y, class = oldClass(x), row.names = .row_names_info(x,
+ 0L)))
+ }
+ if (missing(i)) {
+ if (missing(j) && drop && length(x) == 1L)
+ return(.subset2(x, 1L))
+ y <- if (missing(j))
+ x
+ else .subset(x, j)
+ if (drop && length(y) == 1L)
+ return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done. Is this intended?
+ cols <- names(y)
+ if (any(is.na(cols)))
+ stop("undefined columns selected")
+ if (any(duplicated(cols)))
+ names(y) <- make.unique(cols)
+ nrow <- .row_names_info(x, 2L)
+ if (drop && !mdrop && nrow == 1L)
+ return(structure(y, class = NULL, row.names = NULL))
+ else return(structure(y, class = oldClass(x), row.names = .row_names_info(x,
+ 0L)))
+ }
+ xx <- x
+ cols <- names(xx)
+ x <- vector("list", length(x))
+ x <- .Call("R_copyDFattr", xx, x, PACKAGE = "base")
+ oldClass(x) <- attr(x, "row.names") <- NULL
+ if (!missing(j)) {
+ x <- x[j]
+ cols <- names(x)
+ if (any(is.na(cols)))
+ stop("undefined columns selected")
+ nxx <- structure(seq_along(xx), names = names(xx))
+ sxx <- match(nxx[j], seq_along(xx))
+ }
+ else sxx <- seq_along(x)
+ rows <- NULL
+ if (is.character(i)) {
+ rows <- attr(xx, "row.names")
+ i <- pmatch(i, rows, duplicates.ok = TRUE)
+ }
+ for (j in seq_along(x)) {
+ xj <- xx[[sxx[j]]]
+ x[[j]] <- if (length(dim(xj)) != 2L)
+ xj[i]
+ else xj[i, , drop = FALSE]
+ }
+ if (drop) {
+ n <- length(x)
+ if (n == 1L)
+ return(x[[1L]])
+ if (n > 1L) {
+ xj <- x[[1L]]
+ nrow <- if (length(dim(xj)) == 2L)
+ dim(xj)[1L]
+ else length(xj)
+ drop <- !mdrop && nrow == 1L
+ }
+ else drop <- FALSE
+ }
+ if (!drop) {
+ if (is.null(rows))
+ rows <- attr(xx, "row.names")
+ rows <- rows[i]
+ if ((ina <- any(is.na(rows))) | (dup <- any(duplicated(rows)))) {
+ if (ina)
+ rows[is.na(rows)] <- "NA"
+ if (dup)
+ rows <- make.unique(as.character(rows))
+ }
+ if (any(duplicated(nm <- names(x))))
+ names(x) <- make.unique(nm)
+ if (is.null(rows))
+ rows <- attr(xx, "row.names")[i]
+ attr(x, "row.names") <- rows
+ oldClass(x) <- oldClass(xx)
+ }
+ x
+ }
>
More information about the R-help
mailing list