[Bioc-devel] Changes in AnnotationDbi

Mon Jun 8 21:17:48 CEST 2015

Thanks Marc!

On Mon, Jun 8, 2015 at 3:12 PM, Marc Carlson <mcarlson at fredhutch.org> wrote:

>  OK Jim,
>
> I will put very simple messages in (one liners) that will simply state
> whether the relationship between keys and the requested columns was 1:1,
> 1:many, many:1, or many:many.   Hopefully this will represent an acceptable
> compromise.
>
>  Marc
>
>
>
> On 06/05/2015 08:37 AM, James W. MacDonald wrote:
>
> I agree that a warning is probably not the way to go, as it does imply
> that there might have been something wrong with either the input or output.
> Plus, not everybody understands the distinction between error and warning.
>
>  And having additional documentation can't possibly hurt. But that
> assumes that most/some/all of the end users both peruse and understand the
> documentation, which we all know is not the case. The main issue, for me at
> least, is that a significant proportion of people seem to think there is
> some sort of uniqueness imposed on things like Entrez Gene IDs and Hugo
> symbols, etc. While that is the ultimate goal, we do not have and maybe
> never will achieve unique IDs for each annotatable object.
>
>  I used to work for a PI who was a very smart, well informed statistical
> geneticist who was absolutely shocked when I informed her that a) there are
> SNPs in dbSNP that have more than one RS ID, and that b.) there are RS IDs
> in dbSNP that have been assigned to multiple SNPs. She just assumed that
> there was a one-to-one RS ID -> SNP mapping.
>
>  So this is to me the crux of the problem. It is perfectly valid to
> return one-to-many mappings, and that is what should be expected *by
> those of us who already understand such things. *But for those of us who
> are ignorant of the details, and those who assume uniqueness of IDs, it
> would be really nice if they got a message telling them something like
>
>  *Please note that there are one-to-many mappings between the input and
> output IDs, so the output is longer than your input vector. Please see
> ?select for more detail.*
>
>  And if the message is objectionable to some, you could give the option
> for people to set a global flag to shut it off. Something like
>
>  if(!pleaseMakeItStop)
>   message(<message goes here>)
>
>  and they could set
>
>  pleaseMakeItStop = TRUE in their .Rprofile
>
>  Is that a reasonable compromise?
>
>  Jim
>
>
>
> On Thu, Jun 4, 2015 at 6:06 PM, Marc Carlson <mcarlson at fredhutch.org>
> wrote:
>
>> Hi Jim,
>>
>> I do agree that the warning was protective for that (this is why I put it
>> there).
>>
>> But it was also annoying for many and a source of some confusion because
>> when people see a warning() they think that something has gone wrong with
>> the code that was just run.  And in this case the select method was
>> actually doing exactly what it was supposed to be doing.  What it was
>> actually warning you about was what you did separately in that assignment
>> to fit2...  Which is the step right after the select method already did
>> it's work.  And I can understand why that seems a little bit confusing
>> since you are basically telling someone to be careful with the data you
>> just gave them.
>>
>> Now I could replace it with a message() I guess, but in cases like this
>> where the warning is about something that happens outside of the function
>> you are calling, shouldn't that probably be handled by documentation?  Or
>> at least, that is the argument that finally persuaded me to remove it.
>> That and that fact that almost every call to select() ended up accompanied
>> by the warning you mentioned, because it turns out that perfect 1:1
>> relationships are pretty rare for annotation data.  Very often, you are
>> going to get back multiple results.
>>
>> But I didn't just remove the warning, I also supplied an alternative for
>> people who have a real need for consistent 1:1 mapping.
>>
>> The mapIds() method takes most of the same arguments as select, except
>> that unlike select(), it only looks up one column and it always returns a
>> vector that is the same size as the vector that came in.
>>
>> So for your example, you could do something like this psuedocode here:
>>
>> mapIds(<chippackage>, featureNames(eset), column="ENTREZID",
>> keytype="PROBEID")
>>
>> And mapIds will follow a rule specified by the default value for the
>> multiVals argument so that you can get back your results in a 1:1 manner.
>> And if you don't like any of the options available for the multiVals
>> argument, you can make your own function and pass it in.
>>
>>
>> Anyhow please continue to let us know what you think?
>>
>>
>>  Marc
>>
>>
>>
>>
>>
>>
>>
>> On 06/04/2015 10:50 AM, James W. MacDonald wrote:
>>
>>> In the last release, the warning message from select() telling people
>>> that
>>> their results include one-to-many mappings was removed. While some may
>>> find
>>> this warning annoying, I think silently returning something unexpected to
>>> our users is dangerous.
>>>
>>> In other words, for me it is a common practice to do something like this:
>>>
>>> fit <- lmFit(eset, design)
>>> fit2 <- eBayes(fit)
>>> gns <- select(<chippackage>, featureNames(eset), c("ENTREZID","SYMBOL"))
>>> gns <- gns[!duplicated(gns[,1]),]
>>> fit2$genes <- gns
>>>
>>> I add in the step where dups are removed because I already know they are
>>> there. But a naive user might instead do
>>>
>>> fit2$genes <- select(<chippackage>, featureNames(eset),
>>> c("ENTREZID","SYMBOL"))
>>>
>>> Which will work just fine, but then all the annotation (except for the
>>> first few lines) will now be completely incorrect, and there wasn't a
>>> warning to let the end user know that they may have made a mistake.
>>>
>>> lmFit() will parse the featureData slot of an ExpressionSet and use those
>>> data for annotation, so that gives some hypothetical protections, for
>>> those
>>> who first put their annotation data into their ExpressionSet. However,
>>> ?eSet says:
>>>
>>>   ‘featureData’: Contains variables describing features (i.e., rows
>>>            in ‘assayData’) unique to this experiment. Use the
>>>            ‘annotation’ slot to efficiently reference feature data
>>>            common to the annotation package used in the experiment.
>>>            Class: ‘AnnotatedDataFrame-class’
>>>
>>> Which to me indicates that the featureData slot isn't really intended to
>>> contain annotation data, but instead some unique information that
>>> pertains
>>> to a given experiment. But maybe I misunderstand.
>>>
>>> Is the featureData slot actually intended for annotation data? If not,
>>> what
>>> is the intended pipeline for annotating data in an ExpressionSet? Am I
>>> alone in being concerned about this?
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>   _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
>
>  --
>  James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]