[Bioc-devel] Changes in AnnotationDbi

Mon Jun 8 21:12:42 CEST 2015

OK Jim,

I will put very simple messages in (one liners) that will simply state 
whether the relationship between keys and the requested columns was 1:1, 
1:many, many:1, or many:many.   Hopefully this will represent an 
acceptable compromise.

  Marc

On 06/05/2015 08:37 AM, James W. MacDonald wrote:
> I agree that a warning is probably not the way to go, as it does imply 
> that there might have been something wrong with either the input or 
> output. Plus, not everybody understands the distinction between error 
> and warning.
>
> And having additional documentation can't possibly hurt. But that 
> assumes that most/some/all of the end users both peruse and understand 
> the documentation, which we all know is not the case. The main issue, 
> for me at least, is that a significant proportion of people seem to 
> think there is some sort of uniqueness imposed on things like Entrez 
> Gene IDs and Hugo symbols, etc. While that is the ultimate goal, we do 
> not have and maybe never will achieve unique IDs for each annotatable 
> object.
>
> I used to work for a PI who was a very smart, well informed 
> statistical geneticist who was absolutely shocked when I informed her 
> that a) there are SNPs in dbSNP that have more than one RS ID, and 
> that b.) there are RS IDs in dbSNP that have been assigned to multiple 
> SNPs. She just assumed that there was a one-to-one RS ID -> SNP mapping.
>
> So this is to me the crux of the problem. It is perfectly valid to 
> return one-to-many mappings, and that is what should be expected /by 
> those of us who already understand such things. /But for those of us 
> who are ignorant of the details, and those who assume uniqueness of 
> IDs, it would be really nice if they got a message telling them 
> something like
>
> /Please note that there are one-to-many mappings between the input and 
> output IDs, so the output is longer than your input vector. Please see 
> ?select for more detail./
> /
> /
> And if the message is objectionable to some, you could give the option 
> for people to set a global flag to shut it off. Something like
>
> if(!pleaseMakeItStop)
>   message(<message goes here>)
>
> and they could set
>
> pleaseMakeItStop = TRUE in their .Rprofile
>
> Is that a reasonable compromise?
>
> Jim
>
>
>
> On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org 
> <mailto:mcarlson at fredhutch.org>> wrote:
>
>     Hi Jim,
>
>     I do agree that the warning was protective for that (this is why I
>     put it there).
>
>     But it was also annoying for many and a source of some confusion
>     because when people see a warning() they think that something has
>     gone wrong with the code that was just run.  And in this case the
>     select method was actually doing exactly what it was supposed to
>     be doing.  What it was actually warning you about was what you did
>     separately in that assignment to fit2...  Which is the step right
>     after the select method already did it's work.  And I can
>     understand why that seems a little bit confusing since you are
>     basically telling someone to be careful with the data you just
>     gave them.
>
>     Now I could replace it with a message() I guess, but in cases like
>     this where the warning is about something that happens outside of
>     the function you are calling, shouldn't that probably be handled
>     by documentation?  Or at least, that is the argument that finally
>     persuaded me to remove it.  That and that fact that almost every
>     call to select() ended up accompanied by the warning you
>     mentioned, because it turns out that perfect 1:1 relationships are
>     pretty rare for annotation data.  Very often, you are going to get
>     back multiple results.
>
>     But I didn't just remove the warning, I also supplied an
>     alternative for people who have a real need for consistent 1:1
>     mapping.
>
>     The mapIds() method takes most of the same arguments as select,
>     except that unlike select(), it only looks up one column and it
>     always returns a vector that is the same size as the vector that
>     came in.
>
>     So for your example, you could do something like this psuedocode here:
>
>     mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
>     keytype="PROBEID")
>
>     And mapIds will follow a rule specified by the default value for
>     the multiVals argument so that you can get back your results in a
>     1:1 manner.  And if you don't like any of the options available
>     for the multiVals argument, you can make your own function and
>     pass it in.
>
>
>     Anyhow please continue to let us know what you think?
>
>
>      Marc
>
>
>
>
>
>
>
>     On 06/04/2015 10:50 AM, James W. MacDonald wrote:
>
>         In the last release, the warning message from select() telling
>         people that
>         their results include one-to-many mappings was removed. While
>         some may find
>         this warning annoying, I think silently returning something
>         unexpected to
>         our users is dangerous.
>
>         In other words, for me it is a common practice to do something
>         like this:
>
>         fit <- lmFit(eset, design)
>         fit2 <- eBayes(fit)
>         gns <- select(<chippackage>, featureNames(eset),
>         c("ENTREZID","SYMBOL"))
>         gns <- gns[!duplicated(gns[,1]),]
>         fit2$genes <- gns
>
>         I add in the step where dups are removed because I already
>         know they are
>         there. But a naive user might instead do
>
>         fit2$genes <- select(<chippackage>, featureNames(eset),
>         c("ENTREZID","SYMBOL"))
>
>         Which will work just fine, but then all the annotation (except
>         for the
>         first few lines) will now be completely incorrect, and there
>         wasn't a
>         warning to let the end user know that they may have made a
>         mistake.
>
>         lmFit() will parse the featureData slot of an ExpressionSet
>         and use those
>         data for annotation, so that gives some hypothetical
>         protections, for those
>         who first put their annotation data into their ExpressionSet.
>         However,
>         ?eSet says:
>
>           ‘featureData’: Contains variables describing features (i.e.,
>         rows
>                    in ‘assayData’) unique to this experiment. Use the
>                    ‘annotation’ slot to efficiently reference feature data
>                    common to the annotation package used in the
>         experiment.
>                    Class: ‘AnnotatedDataFrame-class’
>
>         Which to me indicates that the featureData slot isn't really
>         intended to
>         contain annotation data, but instead some unique information
>         that pertains
>         to a given experiment. But maybe I misunderstand.
>
>         Is the featureData slot actually intended for annotation data?
>         If not, what
>         is the intended pipeline for annotating data in an
>         ExpressionSet? Am I
>         alone in being concerned about this?
>
>         Best,
>
>         Jim
>
>
>
>     _______________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>     list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
>
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

	[[alternative HTML version deleted]]