[Bioc-devel] Changes in AnnotationDbi
James W. MacDonald
jmacdon at uw.edu
Mon Jun 8 21:17:48 CEST 2015
On Mon, Jun 8, 2015 at 3:12 PM, Marc Carlson <mcarlson at fredhutch.org> wrote:
> OK Jim,
> I will put very simple messages in (one liners) that will simply state
> whether the relationship between keys and the requested columns was 1:1,
> 1:many, many:1, or many:many. Hopefully this will represent an acceptable
> On 06/05/2015 08:37 AM, James W. MacDonald wrote:
> I agree that a warning is probably not the way to go, as it does imply
> that there might have been something wrong with either the input or output.
> Plus, not everybody understands the distinction between error and warning.
> And having additional documentation can't possibly hurt. But that
> assumes that most/some/all of the end users both peruse and understand the
> documentation, which we all know is not the case. The main issue, for me at
> least, is that a significant proportion of people seem to think there is
> some sort of uniqueness imposed on things like Entrez Gene IDs and Hugo
> symbols, etc. While that is the ultimate goal, we do not have and maybe
> never will achieve unique IDs for each annotatable object.
> I used to work for a PI who was a very smart, well informed statistical
> geneticist who was absolutely shocked when I informed her that a) there are
> SNPs in dbSNP that have more than one RS ID, and that b.) there are RS IDs
> in dbSNP that have been assigned to multiple SNPs. She just assumed that
> there was a one-to-one RS ID -> SNP mapping.
> So this is to me the crux of the problem. It is perfectly valid to
> return one-to-many mappings, and that is what should be expected *by
> those of us who already understand such things. *But for those of us who
> are ignorant of the details, and those who assume uniqueness of IDs, it
> would be really nice if they got a message telling them something like
> *Please note that there are one-to-many mappings between the input and
> output IDs, so the output is longer than your input vector. Please see
> ?select for more detail.*
> And if the message is objectionable to some, you could give the option
> for people to set a global flag to shut it off. Something like
> message(<message goes here>)
> and they could set
> pleaseMakeItStop = TRUE in their .Rprofile
> Is that a reasonable compromise?
> On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org>
>> Hi Jim,
>> I do agree that the warning was protective for that (this is why I put it
>> But it was also annoying for many and a source of some confusion because
>> when people see a warning() they think that something has gone wrong with
>> the code that was just run. And in this case the select method was
>> actually doing exactly what it was supposed to be doing. What it was
>> actually warning you about was what you did separately in that assignment
>> to fit2... Which is the step right after the select method already did
>> it's work. And I can understand why that seems a little bit confusing
>> since you are basically telling someone to be careful with the data you
>> just gave them.
>> Now I could replace it with a message() I guess, but in cases like this
>> where the warning is about something that happens outside of the function
>> you are calling, shouldn't that probably be handled by documentation? Or
>> at least, that is the argument that finally persuaded me to remove it.
>> That and that fact that almost every call to select() ended up accompanied
>> by the warning you mentioned, because it turns out that perfect 1:1
>> relationships are pretty rare for annotation data. Very often, you are
>> going to get back multiple results.
>> But I didn't just remove the warning, I also supplied an alternative for
>> people who have a real need for consistent 1:1 mapping.
>> The mapIds() method takes most of the same arguments as select, except
>> that unlike select(), it only looks up one column and it always returns a
>> vector that is the same size as the vector that came in.
>> So for your example, you could do something like this psuedocode here:
>> mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
>> And mapIds will follow a rule specified by the default value for the
>> multiVals argument so that you can get back your results in a 1:1 manner.
>> And if you don't like any of the options available for the multiVals
>> argument, you can make your own function and pass it in.
>> Anyhow please continue to let us know what you think?
>> On 06/04/2015 10:50 AM, James W. MacDonald wrote:
>>> In the last release, the warning message from select() telling people
>>> their results include one-to-many mappings was removed. While some may
>>> this warning annoying, I think silently returning something unexpected to
>>> our users is dangerous.
>>> In other words, for me it is a common practice to do something like this:
>>> fit <- lmFit(eset, design)
>>> fit2 <- eBayes(fit)
>>> gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
>>> gns <- gns[!duplicated(gns[,1]),]
>>> fit2$genes <- gns
>>> I add in the step where dups are removed because I already know they are
>>> there. But a naive user might instead do
>>> fit2$genes <- select(<chippackage>, featureNames(eset),
>>> Which will work just fine, but then all the annotation (except for the
>>> first few lines) will now be completely incorrect, and there wasn't a
>>> warning to let the end user know that they may have made a mistake.
>>> lmFit() will parse the featureData slot of an ExpressionSet and use those
>>> data for annotation, so that gives some hypothetical protections, for
>>> who first put their annotation data into their ExpressionSet. However,
>>> ?eSet says:
>>> ‘featureData’: Contains variables describing features (i.e., rows
>>> in ‘assayData’) unique to this experiment. Use the
>>> ‘annotation’ slot to efficiently reference feature data
>>> common to the annotation package used in the experiment.
>>> Class: ‘AnnotatedDataFrame-class’
>>> Which to me indicates that the featureData slot isn't really intended to
>>> contain annotation data, but instead some unique information that
>>> to a given experiment. But maybe I misunderstand.
>>> Is the featureData slot actually intended for annotation data? If not,
>>> is the intended pipeline for annotating data in an ExpressionSet? Am I
>>> alone in being concerned about this?
>> Bioc-devel at r-project.org mailing list
> James W. MacDonald, M.S.
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
[[alternative HTML version deleted]]
More information about the Bioc-devel