[R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame)

Sat Aug 4 02:02:28 CEST 2007

Hi Bert,

> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Fri 8/3/2007 3:19 PM
> To: Steven McKinney; r-help at stat.math.ethz.ch
> Subject: RE: [R] FW: Selecting undefined column of a data frame (was [BioC]read.phenoData vs read.AnnotatedDataFrame)
>  
> I suspect you'll get some creative answers, but if all you're worried about
> is whether a column exists before you do something with it, what's wrong
> with:
> 
> nm <- ... ## a character vector of names
> if(!all(nm %in% names(yourdata))) ## complain
> else ## do something
> 
> 
> I think this is called defensive programming.

This is a good example of good defensive programming.
I do indeed check variable/object names whenever
obtaining them from an external source (user input,
file input, a list in code).

I was able to practice a defensive programming style in the past
by using
 > bar <- foo[, "FileName"]
instead of
 > bar <- foo$FileName

but this has changed recently, so I need to figure out
some other mechanisms.

R is such a productive language, but this change will
lead many of us to chase elusive typos that used to
get revealed.

I'm hoping that some kind of explicit data frame variable
checking mechanism might be introduced since we've
lost this one.  

It would also be great to have such a
mechanism to help catch list access and extraction
errors.  Why should
foo$FileName
always quietly return NULL?

I'm not sure why the following incongruity is okay.

> foo <- matrix(1:4, nrow = 2)
> dimnames(foo) <- list(NULL, c("a", "b"))
> bar <- foo[, "A"]
Error: subscript out of bounds

> foo.df <- as.data.frame(foo)
> foo.df
  a b
1 1 3
2 2 4
> bar <- foo.df[, "A"]
> bar
NULL
> 

It is a lot of extra typing to wrap every command in
extra code, but more of that will need to happen
going forward.

Steve McKinney

> 
> Bert Gunter
> Genentech
> 
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Steven McKinney
> Sent: Friday, August 03, 2007 10:38 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] FW: Selecting undefined column of a data frame (was
> [BioC]read.phenoData vs read.AnnotatedDataFrame)
> 
> Hi all,
> 
> What are current methods people use in R to identify
> mis-spelled column names when selecting columns
> from a data frame?
> 
> Alice Johnson recently tackled this issue
> (see [BioC] posting below).
> 
> Due to a mis-spelled column name ("FileName"
> instead of "Filename") which produced no warning,
> Alice spent a fair amount of time tracking down
> this bug.  With my fumbling fingers I'll be tracking
> down such a bug soon too.
> 
> Is there any options() setting, or debug technique
> that will flag data frame column extractions that
> reference a non-existent column?  It seems to me
> that the "[.data.frame" extractor used to throw an
> error if given a mis-spelled variable name, and I
> still see lines of code in "[.data.frame" such as
> 
> if (any(is.na(cols))) 
>             stop("undefined columns selected")
> 
> 
> 
> In R 2.5.1 a NULL is silently returned.
> 
> > foo <- data.frame(Filename = c("a", "b"))
> > foo[, "FileName"]
> NULL
> 
> Has something changed so that the code lines
> if (any(is.na(cols))) 
>             stop("undefined columns selected")
> in "[.data.frame" no longer work properly (if
> I am understanding the intention properly)?
> 
> If not, could  "[.data.frame" check an
> options() variable setting (say
> warn.undefined.colnames) and throw a warning
> if a non-existent column name is referenced?
> 
> 
> 
> 
> > sessionInfo()
> R version 2.5.1 (2007-06-27) 
> powerpc-apple-darwin8.9.1 
> 
> locale:
> en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
> 
> attached base packages:
> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"
> "base"     
> 
> other attached packages:
>      plotrix         lme4       Matrix      lattice 
>      "2.2-3"  "0.99875-4" "0.999375-0"     "0.16-2" 
> > 
> 
> 
> 
> Steven McKinney
> 
> Statistician
> Molecular Oncology and Breast Cancer Program
> British Columbia Cancer Research Centre
> 
> email: smckinney +at+ bccrc +dot+ ca
> 
> tel: 604-675-8000 x7561
> 
> BCCRC
> Molecular Oncology
> 675 West 10th Ave, Floor 4
> Vancouver B.C. 
> V5Z 1L3
> Canada
> 
> 
> 
> 
> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch on behalf of Johnstone, Alice
> Sent: Wed 8/1/2007 7:20 PM
> To: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame
>  
>  For interest sake, I have found out why I wasn't getting my expected
> results when using read.AnnotatedDataFrame
> Turns out the error was made in the ReadAffy command, where I specified
> the filenames to be read from my AnnotatedDataFrame object.  There was a
> typo error with a capital N ($FileName) rather than lowercase n
> ($Filename) as in my target file..whoops.  However this meant the
> filename argument was ignored without the error message(!) and instead
> of using the information in the AnnotatedDataFrame object (which
> included filenames, but not alphabetically) it read the .cel files in
> alphabetical order from the working directory - hence the wrong file was
> given the wrong label (given by the order of Annotated object) and my
> comparisons were confused without being obvious as to why or where.
> Our solution: specify that filename is as.character so assignment of
> file to target is correct(after correcting $Filename) now that using
> read.AnnotatedDataFrame rather than readphenoData.
> 
> Data<-ReadAffy(filenames=as.character(pData(pd)$Filename),phenoData=pd)
> 
> Hurrah!
> 
> It may be beneficial to others, that if the filename argument isn't
> specified, that filenames are read from the phenoData object if included
> here.
> 
> Thanks!
> 
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org] 
> Sent: Thursday, 26 July 2007 11:49 a.m.
> To: Johnstone, Alice
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame
> 
> Hi Alice --
> 
> "Johnstone, Alice" <Alice.Johnstone at esr.cri.nz> writes:
> 
> > Using R2.5.0 and Bioconductor I have been following code to analysis 
> > Affymetrix expression data: 2 treatments vs control.  The original 
> > code was run last year and used the read.phenoData command, however 
> > with the newer version I get the error message Warning messages:
> > read.phenoData is deprecated, use read.AnnotatedDataFrame instead The 
> > phenoData class is deprecated, use AnnotatedDataFrame (with
> > ExpressionSet) instead
> >  
> > I use the read.AnnotatedDataFrame command, but when it comes to the 
> > end of the analysis the comparison of the treatment to the controls 
> > gets mixed up compared to what you get using the original 
> > read.phenoData ie it looks like the 3 groups get labelled wrong and so
> 
> > the comparisons are different (but they can still be matched up).
> > My questions are,
> > 1) do you need to set up your target file differently when using 
> > read.AnnotatedDataFrame - what is the standard format?
> 
> I can't quite tell where things are going wrong for you, so it would
> help if you can narrow down where the problem occurs.  I think
> read.AnnotatedDataFrame should be comparable to read.phenoData. Does
> 
> > pData(pd)
> 
> look right? What about
> 
> > pData(Data)
> 
> and
> 
> > pData(eset.rma)
> 
> ? It's not important but pData(pd)$Target is the same as pd$Target.
> Since the analysis is on eset.rma, it probably makes sense to use the
> pData from there to construct your design matrix
> 
> > targs<-factor(eset.rma$Target)
> > design<-model.matrix(~0+targs)
> > colnames(design)<-levels(targs)
> 
> Does design look right?
> 
> > I have three columns sample, filename and target.
> > 2) do you need to use a different model matrix to what I have?  
> > 3) do you use a different command for making the contrasts?
> 
> Depends on the question! If you're performing the same analysis as last
> year, then the model matrix and contrasts have to be the same!
> 
> > I have included my code below if that is of any assistance.
> > Many Thanks!
> > Alice
> >  
> >  
> >  
> > ##Read data
> > pd<-read.AnnotatedDataFrame("targets.txt",header=T,row.name="sample")
> > Data<-ReadAffy(filenames=pData(pd)$FileName,phenoData=pd)
> > ##normalisation
> > eset.rma<-rma(Data)
> > ##analysis
> > targs<-factor(pData(pd)$Target)
> > design<-model.matrix(~0+targs)
> > colnames(design)<-levels(targs)
> > fit<-lmFit(eset.rma,design)
> > cont.wt<-makeContrasts("treatment1-control","treatment2-control",level
> > s=
> > design)
> > fit2<-contrasts.fit(fit,cont.wt)
> > fit2.eb<-eBayes(fit2)
> > testconts<-classifyTestsF(fit2.eb,p.value=0.01)
> > topTable(fit2.eb,coef=2,n=300)
> > topTable(fit2.eb,coef=1,n=300)
> >  
> >
> > 	[[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> --
> Martin Morgan
> Bioconductor / Computational Biology
> http://bioconductor.org
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>