[Bioc-devel] Changes in AnnotationDbi

Fri Jun 5 17:37:21 CEST 2015

I agree that a warning is probably not the way to go, as it does imply that
there might have been something wrong with either the input or output.
Plus, not everybody understands the distinction between error and warning.

And having additional documentation can't possibly hurt. But that assumes
that most/some/all of the end users both peruse and understand the
documentation, which we all know is not the case. The main issue, for me at
least, is that a significant proportion of people seem to think there is
some sort of uniqueness imposed on things like Entrez Gene IDs and Hugo
symbols, etc. While that is the ultimate goal, we do not have and maybe
never will achieve unique IDs for each annotatable object.

I used to work for a PI who was a very smart, well informed statistical
geneticist who was absolutely shocked when I informed her that a) there are
SNPs in dbSNP that have more than one RS ID, and that b.) there are RS IDs
in dbSNP that have been assigned to multiple SNPs. She just assumed that
there was a one-to-one RS ID -> SNP mapping.

So this is to me the crux of the problem. It is perfectly valid to return
one-to-many mappings, and that is what should be expected *by those of us
who already understand such things. *But for those of us who are ignorant
of the details, and those who assume uniqueness of IDs, it would be really
nice if they got a message telling them something like

*Please note that there are one-to-many mappings between the input and
output IDs, so the output is longer than your input vector. Please see
?select for more detail.*

And if the message is objectionable to some, you could give the option for
people to set a global flag to shut it off. Something like

if(!pleaseMakeItStop)
  message(<message goes here>)

and they could set

pleaseMakeItStop = TRUE in their .Rprofile

Is that a reasonable compromise?

Jim

On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org> wrote:

> Hi Jim,
>
> I do agree that the warning was protective for that (this is why I put it
> there).
>
> But it was also annoying for many and a source of some confusion because
> when people see a warning() they think that something has gone wrong with
> the code that was just run.  And in this case the select method was
> actually doing exactly what it was supposed to be doing.  What it was
> actually warning you about was what you did separately in that assignment
> to fit2...  Which is the step right after the select method already did
> it's work.  And I can understand why that seems a little bit confusing
> since you are basically telling someone to be careful with the data you
> just gave them.
>
> Now I could replace it with a message() I guess, but in cases like this
> where the warning is about something that happens outside of the function
> you are calling, shouldn't that probably be handled by documentation?  Or
> at least, that is the argument that finally persuaded me to remove it.
> That and that fact that almost every call to select() ended up accompanied
> by the warning you mentioned, because it turns out that perfect 1:1
> relationships are pretty rare for annotation data.  Very often, you are
> going to get back multiple results.
>
> But I didn't just remove the warning, I also supplied an alternative for
> people who have a real need for consistent 1:1 mapping.
>
> The mapIds() method takes most of the same arguments as select, except
> that unlike select(), it only looks up one column and it always returns a
> vector that is the same size as the vector that came in.
>
> So for your example, you could do something like this psuedocode here:
>
> mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
> keytype="PROBEID")
>
> And mapIds will follow a rule specified by the default value for the
> multiVals argument so that you can get back your results in a 1:1 manner.
> And if you don't like any of the options available for the multiVals
> argument, you can make your own function and pass it in.
>
>
> Anyhow please continue to let us know what you think?
>
>
>  Marc
>
>
>
>
>
>
>
> On 06/04/2015 10:50 AM, James W. MacDonald wrote:
>
>> In the last release, the warning message from select() telling people that
>> their results include one-to-many mappings was removed. While some may
>> find
>> this warning annoying, I think silently returning something unexpected to
>> our users is dangerous.
>>
>> In other words, for me it is a common practice to do something like this:
>>
>> fit <- lmFit(eset, design)
>> fit2 <- eBayes(fit)
>> gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
>> gns <- gns[!duplicated(gns[,1]),]
>> fit2$genes <- gns
>>
>> I add in the step where dups are removed because I already know they are
>> there. But a naive user might instead do
>>
>> fit2$genes <- select(<chippackage>, featureNames(eset),
>> c("ENTREZID","SYMBOL"))
>>
>> Which will work just fine, but then all the annotation (except for the
>> first few lines) will now be completely incorrect, and there wasn't a
>> warning to let the end user know that they may have made a mistake.
>>
>> lmFit() will parse the featureData slot of an ExpressionSet and use those
>> data for annotation, so that gives some hypothetical protections, for
>> those
>> who first put their annotation data into their ExpressionSet. However,
>> ?eSet says:
>>
>>   ‘featureData’: Contains variables describing features (i.e., rows
>>            in ‘assayData’) unique to this experiment. Use the
>>            ‘annotation’ slot to efficiently reference feature data
>>            common to the annotation package used in the experiment.
>>            Class: ‘AnnotatedDataFrame-class’
>>
>> Which to me indicates that the featureData slot isn't really intended to
>> contain annotation data, but instead some unique information that pertains
>> to a given experiment. But maybe I misunderstand.
>>
>> Is the featureData slot actually intended for annotation data? If not,
>> what
>> is the intended pipeline for annotating data in an ExpressionSet? Am I
>> alone in being concerned about this?
>>
>> Best,
>>
>> Jim
>>
>>
>>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]