[R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)

Steven McKinney smckinney at bccrc.ca
Fri Aug 3 23:50:05 CEST 2007


> What would break is that three methods for doing the same thing would
> give different answers.
> 
> Please do have the courtesy to actually read the detailed explanation you
> are given.

Sorry Prof. Ripley, I am attempting to read carefully, as this
issue has deeper coding/debugging implications, and as you
point out, 
  "[.data.frame is one of the most complex functions in R"
so please bear with me.  This change in behaviour has 
taken away a side-effect debugging tool, discussed below.


> 
> 
> On Fri, 3 Aug 2007, Steven McKinney wrote:
> 
> >
> >> -----Original Message-----
> >> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> >> Sent: Fri 8/3/2007 1:05 PM
> >> To: Steven McKinney
> >> Cc: r-help at stat.math.ethz.ch
> >> Subject: Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)
> >>
> >> I've since seen your followup a more detailed explanation may help.
> >> The path through the code for your argument list does not go where you
> >> quoted, and there is a reason for it.
> >
> >
> >> Generally when you extract in R and ask for an non-existent index you get
> >> NA or NULL as the result (and no warning), e.g.
> >>
> >>> y <- list(x=1, y=2)
> >>> y[["z"]]
> >> NULL
> >>
> >> Because data frames 'must' have (column) names, they are a partial
> >> exception and when the result is a data frame you get an error if it would
> >> contain undefined columns.
> >>
> >> But in the case of foo[, "FileName"], the result is a single column and so
> >> will not have a name: there seems no reason to be different from
> >>
> >>> foo[["FileName"]]
> >> NULL
> >>> foo$FileName
> >> NULL
> >>
> >> which similarly select a single column.  At one time they were different
> >> in R, for no documented reason.


This difference provided a side-effect debugging tool, in that where

  > bar <- foo[, "FileName"]

used to throw an error, alerting as to a typo, it now does not.

Having been burned by NULL results due to typos in code lines using
the $ extractor such as
 
  > bar <- foo$FileName

I learned to use
  > bar <- foo[, "FileName"]
to help cut down on typo bugs.  With the ubiquity of
camelCase object names, this is a constant typing bug hazard.


I am wondering what to do now to double check spelling
when accessing columns of a dataframe.

If "[.data.frame" stays as is, can a debug mechanism
be implemented in R that forces strict adherence
to existing list names in debug mode?  This would also help debug
typos in camelCase names when using the $ and [[
extractors and accessors.

Are there other debugging tools already in R that
can help point out such camelCase list element
name typos?



> >>
> >>
> >> On Fri, 3 Aug 2007, Prof Brian Ripley wrote:
> >>
> >>> You are reading the wrong part of the code for your argument list:
> >>>
> >>>>  foo["FileName"]
> >>> Error in `[.data.frame`(foo, "FileName") : undefined columns selected
> >>>
> >>> [.data.frame is one of the most complex functions in R, and does many
> >>> different things depending on which arguments are supplied.
> >>>
> >>>
> >>> On Fri, 3 Aug 2007, Steven McKinney wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> What are current methods people use in R to identify
> >>>> mis-spelled column names when selecting columns
> >>>> from a data frame?
> >>>>
> >>>> Alice Johnson recently tackled this issue
> >>>> (see [BioC] posting below).
> >>>>
> >>>> Due to a mis-spelled column name ("FileName"
> >>>> instead of "Filename") which produced no warning,
> >>>> Alice spent a fair amount of time tracking down
> >>>> this bug.  With my fumbling fingers I'll be tracking
> >>>> down such a bug soon too.
> >>>>
> >>>> Is there any options() setting, or debug technique
> >>>> that will flag data frame column extractions that
> >>>> reference a non-existent column?  It seems to me
> >>>> that the "[.data.frame" extractor used to throw an
> >>>> error if given a mis-spelled variable name, and I
> >>>> still see lines of code in "[.data.frame" such as
> >>>>
> >>>> if (any(is.na(cols)))
> >>>>            stop("undefined columns selected")
> >>>>
> >>>>
> >>>>
> >>>> In R 2.5.1 a NULL is silently returned.
> >>>>
> >>>>> foo <- data.frame(Filename = c("a", "b"))
> >>>>> foo[, "FileName"]
> >>>> NULL
> >>>>
> >>>> Has something changed so that the code lines
> >>>> if (any(is.na(cols)))
> >>>>            stop("undefined columns selected")
> >>>> in "[.data.frame" no longer work properly (if
> >>>> I am understanding the intention properly)?
> >>>>
> >>>> If not, could  "[.data.frame" check an
> >>>> options() variable setting (say
> >>>> warn.undefined.colnames) and throw a warning
> >>>> if a non-existent column name is referenced?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> sessionInfo()
> >>>> R version 2.5.1 (2007-06-27)
> >>>> powerpc-apple-darwin8.9.1
> >>>>
> >>>> locale:
> >>>> en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
> >>>>
> >>>> attached base packages:
> >>>> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"
> >>>> "base"
> >>>>
> >>>> other attached packages:
> >>>>     plotrix         lme4       Matrix      lattice
> >>>>     "2.2-3"  "0.99875-4" "0.999375-0"     "0.16-2"
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> Steven McKinney
> >>>>
> >>>> Statistician
> >>>> Molecular Oncology and Breast Cancer Program
> >>>> British Columbia Cancer Research Centre
> >>>>
> >>>> email: smckinney +at+ bccrc +dot+ ca
> >>>>
> >>>> tel: 604-675-8000 x7561
> >>>>
> >>>> BCCRC
> >>>> Molecular Oncology
> >>>> 675 West 10th Ave, Floor 4
> >>>> Vancouver B.C.
> >>>> V5Z 1L3
> >>>> Canada
> >>>>
> >>>>
> >>
> >>
> >> --
> >> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> >> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> >> University of Oxford,             Tel:  +44 1865 272861 (self)
> >> 1 South Parks Road,                     +44 1865 272866 (PA)
> >> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>
> >>
> >>
> >
> >
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list