[Bioc-devel] Behavior of select function in AnnotationDbi

Sat Nov 21 17:28:15 CET 2015

Hi Hervé,

I think you make a valid point, but I am not sure it conforms to either the
historical expectations for the annotation packages, nor the expected use
case. In other words, the annotation packages have, since the beginning,
returned something that was as close to parallel to the input as possible.
This started with the env-based packages, where e.g., mget() returned
things parallel to the input, rather than randomly, as the SQL model would
have done. This was actually true for biomaRt back in the day when you
could use either the MySQL or CURL interfaces. The SQL interface returned
something parallel to input, and it was a bit of a change when Steffen went
to just using the CURL interface, and people had to learn to add the query
key into the return in order to re-sort.

So from a historical perspective it would be a real change to go from
returning something as close to parallel to the input to a SQL type model.
I agree that makes sense for those who are versed in SQL, but if we were
really trying to help people familiar with SQL, we would just be giving
them the RSQLite DB and telling them to have at it, no? The whole rationale
behind wrapping the DB in a package is to shield the end user from having
to understand SQL, which includes knowing that SQL doesn't enforce the
order of the returned data.

>From a use case perspective, the only use case that I am familiar with is
the one where people have a rectangular array of data, where the rows are
genes or whatnot, and they want to append additional columns to the array,
where the added columns conform to the existing array. This is not always
possible to do in the case of 1:many mappings, obviously, but that is what
mapIds() is for - there is an implicit contract there that you can input a
set of keys, and get out an ordered result that you can then tack onto the
side of your array of data and go forward. I think that is a really useful
thing to be able to do. It is obviously quite easy to take two matrices
that are ordered differently, reorder one, and then bind together. But if
you don't have to do that, then it's an improvement, and quite helpful for
naive users who might not have considered that the returned result might
not actually be parallel to their input.

And being the one who convinced Marc to put the warnings back in, I will
have to respectively disagree about whether or not it is useful to emit the
warnings. Well, I agree it isn't useful at all for those of us who already
know that the results might not be parallel. But that isn't the reason for
a warning is it? To warn those who already understand what is going on? My
argument is (and was) that a warning is intended to tell people that
something that they *might not have expected* just happened. I don't think
it is unreasonable to think that a significant proportion of naive end
users would get tripped up by that if the warning weren't there. They still
might be, but at least we gave them a chance.

Anyway, it took like three lines of code to make it work consistently for
many:many mappings, so that's what I did.

Best,

Jim

On Fri, Nov 20, 2015 at 6:32 PM, Hervé Pagès <hpages at fredhutch.org> wrote:

> On 11/20/2015 03:21 PM, Hervé Pagès wrote:
>
>> Hi Jim,
>>
>> I think we should choose the biomaRt model, that is, duplicated are
>> allowed but silently ignored.
>>
>> Note that this is also the SQL model. When you do
>>
>>    SELECT * FROM ... WHERE key IN c('key1', 'key2', ...)
>>
>                                     ^
> I meant:
>
>      SELECT * FROM ... WHERE key IN ('key1', 'key2', ...)
>
> No c() (too much R lately...)
>
> H.
>
>
>
>> duplicated keys don't generate duplicates in the output.
>>
>> Also note that, like SELECT, even if the keys supplied to
>> biomaRt::getBM() (via the 'values' arg) don't contain duplicates
>> and if all the mappings are 1-to-1, biomaRt::getBM() is not
>> guarantee to preserve order.
>>
>> Generally speaking having duplicates in the input produce duplicates
>> in the output is useful in vectorized operations when the output
>> is expected to be parallel to the input. Vectorized operations also
>> need to propagate NAs and to preserve order. However, like SELECT
>> and biomaRt::getBM(), select() cannot produce an output that is
>> parallel to the input *in general*.
>>
>> It seems that the current philosophy for select() is to emit a note
>> or a warning every time the output is not parallel to the input.
>> Personally I find this too noisy and not that useful.
>>
>> Thanks,
>> H.
>>
>>
>> On 11/20/2015 02:30 PM, James W. MacDonald wrote:
>>
>>> There is an inconsistency in how select() works in AnnotationDbi when a
>>> user passes in duplicated keys to be mapped, depending on if the
>>> mapping is
>>> 1:1 or 1:many. It's easiest to show using an example.
>>>
>>> select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
>>>>
>>> 'select()' returned many:1 mapping between keys and columns
>>>    ENTREZID SYMBOL
>>> 1        1   A1BG
>>> 2        1   A1BG
>>> 3        1   A1BG
>>>
>>> select(org.Hs.eg.db, rep("1", 3), "GO")
>>>>
>>> 'select()' returned many:many mapping between keys and columns
>>>    ENTREZID         GO EVIDENCE ONTOLOGY
>>> 1        1 GO:0003674       ND       MF
>>> 2        1 GO:0003674       ND       MF
>>> 3        1 GO:0003674       ND       MF
>>>
>>> This is obviously a bug. A single query for that ID results in this:
>>>
>>> select(org.Hs.eg.db, "1", "GO")
>>>>
>>> 'select()' returned 1:many mapping between keys and columns
>>>    ENTREZID         GO EVIDENCE ONTOLOGY
>>> 1        1 GO:0003674       ND       MF
>>> 2        1 GO:0005576      IDA       CC
>>> 3        1 GO:0005615      IDA       CC
>>> 4        1 GO:0008150       ND       BP
>>> 5        1 GO:0070062      IDA       CC
>>> 6        1 GO:0072562      IDA       CC
>>>
>>> So the returned results are completely borked.
>>>
>>> However, the question I have is what should be returned? To be consistent
>>> with the first example, it should be the output expected for a single
>>> key,
>>> repeated three times, which I have patched AnnotationDbi to do:
>>>
>>> select(org.Hs.eg.db, rep("1", 3), "GO")
>>>>
>>> 'select()' returned many:many mapping between keys and columns
>>>     ENTREZID         GO EVIDENCE ONTOLOGY
>>> 1         1 GO:0003674       ND       MF
>>> 2         1 GO:0005576      IDA       CC
>>> 3         1 GO:0005615      IDA       CC
>>> 4         1 GO:0008150       ND       BP
>>> 5         1 GO:0070062      IDA       CC
>>> 6         1 GO:0072562      IDA       CC
>>> 7         1 GO:0003674       ND       MF
>>> 8         1 GO:0005576      IDA       CC
>>> 9         1 GO:0005615      IDA       CC
>>> 10        1 GO:0008150       ND       BP
>>> 11        1 GO:0070062      IDA       CC
>>> 12        1 GO:0072562      IDA       CC
>>> 13        1 GO:0003674       ND       MF
>>> 14        1 GO:0005576      IDA       CC
>>> 15        1 GO:0005615      IDA       CC
>>> 16        1 GO:0008150       ND       BP
>>> 17        1 GO:0070062      IDA       CC
>>> 18        1 GO:0072562      IDA       CC
>>>
>>> So, two questions.
>>>
>>>
>>>     1. Should duplicate keys be allowed, or should duplicates be removed
>>>     before querying the database, preferably with a message saying
>>> that dups
>>>     were removed?
>>>     2. If the answer to #1 is yes, then to be consistent, I will just
>>> commit
>>>     the patch I have made to both devel and release.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]