[Bioc-devel] Behavior of select function in AnnotationDbi

Sat Nov 21 00:32:40 CET 2015

On 11/20/2015 03:21 PM, Hervé Pagès wrote:
> Hi Jim,
>
> I think we should choose the biomaRt model, that is, duplicated are
> allowed but silently ignored.
>
> Note that this is also the SQL model. When you do
>
>    SELECT * FROM ... WHERE key IN c('key1', 'key2', ...)
                                     ^
I meant:

      SELECT * FROM ... WHERE key IN ('key1', 'key2', ...)

No c() (too much R lately...)

H.

>
> duplicated keys don't generate duplicates in the output.
>
> Also note that, like SELECT, even if the keys supplied to
> biomaRt::getBM() (via the 'values' arg) don't contain duplicates
> and if all the mappings are 1-to-1, biomaRt::getBM() is not
> guarantee to preserve order.
>
> Generally speaking having duplicates in the input produce duplicates
> in the output is useful in vectorized operations when the output
> is expected to be parallel to the input. Vectorized operations also
> need to propagate NAs and to preserve order. However, like SELECT
> and biomaRt::getBM(), select() cannot produce an output that is
> parallel to the input *in general*.
>
> It seems that the current philosophy for select() is to emit a note
> or a warning every time the output is not parallel to the input.
> Personally I find this too noisy and not that useful.
>
> Thanks,
> H.
>
>
> On 11/20/2015 02:30 PM, James W. MacDonald wrote:
>> There is an inconsistency in how select() works in AnnotationDbi when a
>> user passes in duplicated keys to be mapped, depending on if the
>> mapping is
>> 1:1 or 1:many. It's easiest to show using an example.
>>
>>> select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
>> 'select()' returned many:1 mapping between keys and columns
>>    ENTREZID SYMBOL
>> 1        1   A1BG
>> 2        1   A1BG
>> 3        1   A1BG
>>
>>> select(org.Hs.eg.db, rep("1", 3), "GO")
>> 'select()' returned many:many mapping between keys and columns
>>    ENTREZID         GO EVIDENCE ONTOLOGY
>> 1        1 GO:0003674       ND       MF
>> 2        1 GO:0003674       ND       MF
>> 3        1 GO:0003674       ND       MF
>>
>> This is obviously a bug. A single query for that ID results in this:
>>
>>> select(org.Hs.eg.db, "1", "GO")
>> 'select()' returned 1:many mapping between keys and columns
>>    ENTREZID         GO EVIDENCE ONTOLOGY
>> 1        1 GO:0003674       ND       MF
>> 2        1 GO:0005576      IDA       CC
>> 3        1 GO:0005615      IDA       CC
>> 4        1 GO:0008150       ND       BP
>> 5        1 GO:0070062      IDA       CC
>> 6        1 GO:0072562      IDA       CC
>>
>> So the returned results are completely borked.
>>
>> However, the question I have is what should be returned? To be consistent
>> with the first example, it should be the output expected for a single
>> key,
>> repeated three times, which I have patched AnnotationDbi to do:
>>
>>> select(org.Hs.eg.db, rep("1", 3), "GO")
>> 'select()' returned many:many mapping between keys and columns
>>     ENTREZID         GO EVIDENCE ONTOLOGY
>> 1         1 GO:0003674       ND       MF
>> 2         1 GO:0005576      IDA       CC
>> 3         1 GO:0005615      IDA       CC
>> 4         1 GO:0008150       ND       BP
>> 5         1 GO:0070062      IDA       CC
>> 6         1 GO:0072562      IDA       CC
>> 7         1 GO:0003674       ND       MF
>> 8         1 GO:0005576      IDA       CC
>> 9         1 GO:0005615      IDA       CC
>> 10        1 GO:0008150       ND       BP
>> 11        1 GO:0070062      IDA       CC
>> 12        1 GO:0072562      IDA       CC
>> 13        1 GO:0003674       ND       MF
>> 14        1 GO:0005576      IDA       CC
>> 15        1 GO:0005615      IDA       CC
>> 16        1 GO:0008150       ND       BP
>> 17        1 GO:0070062      IDA       CC
>> 18        1 GO:0072562      IDA       CC
>>
>> So, two questions.
>>
>>
>>     1. Should duplicate keys be allowed, or should duplicates be removed
>>     before querying the database, preferably with a message saying
>> that dups
>>     were removed?
>>     2. If the answer to #1 is yes, then to be consistent, I will just
>> commit
>>     the patch I have made to both devel and release.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319