[Bioc-devel] Behavior of select function in AnnotationDbi

Hervé Pagès hpages at fredhutch.org
Sat Nov 21 00:21:04 CET 2015


Hi Jim,

I think we should choose the biomaRt model, that is, duplicated are
allowed but silently ignored.

Note that this is also the SQL model. When you do

   SELECT * FROM ... WHERE key IN c('key1', 'key2', ...)

duplicated keys don't generate duplicates in the output.

Also note that, like SELECT, even if the keys supplied to
biomaRt::getBM() (via the 'values' arg) don't contain duplicates
and if all the mappings are 1-to-1, biomaRt::getBM() is not
guarantee to preserve order.

Generally speaking having duplicates in the input produce duplicates
in the output is useful in vectorized operations when the output
is expected to be parallel to the input. Vectorized operations also
need to propagate NAs and to preserve order. However, like SELECT
and biomaRt::getBM(), select() cannot produce an output that is
parallel to the input *in general*.

It seems that the current philosophy for select() is to emit a note
or a warning every time the output is not parallel to the input.
Personally I find this too noisy and not that useful.

Thanks,
H.


On 11/20/2015 02:30 PM, James W. MacDonald wrote:
> There is an inconsistency in how select() works in AnnotationDbi when a
> user passes in duplicated keys to be mapped, depending on if the mapping is
> 1:1 or 1:many. It's easiest to show using an example.
>
>> select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
> 'select()' returned many:1 mapping between keys and columns
>    ENTREZID SYMBOL
> 1        1   A1BG
> 2        1   A1BG
> 3        1   A1BG
>
>> select(org.Hs.eg.db, rep("1", 3), "GO")
> 'select()' returned many:many mapping between keys and columns
>    ENTREZID         GO EVIDENCE ONTOLOGY
> 1        1 GO:0003674       ND       MF
> 2        1 GO:0003674       ND       MF
> 3        1 GO:0003674       ND       MF
>
> This is obviously a bug. A single query for that ID results in this:
>
>> select(org.Hs.eg.db, "1", "GO")
> 'select()' returned 1:many mapping between keys and columns
>    ENTREZID         GO EVIDENCE ONTOLOGY
> 1        1 GO:0003674       ND       MF
> 2        1 GO:0005576      IDA       CC
> 3        1 GO:0005615      IDA       CC
> 4        1 GO:0008150       ND       BP
> 5        1 GO:0070062      IDA       CC
> 6        1 GO:0072562      IDA       CC
>
> So the returned results are completely borked.
>
> However, the question I have is what should be returned? To be consistent
> with the first example, it should be the output expected for a single key,
> repeated three times, which I have patched AnnotationDbi to do:
>
>> select(org.Hs.eg.db, rep("1", 3), "GO")
> 'select()' returned many:many mapping between keys and columns
>     ENTREZID         GO EVIDENCE ONTOLOGY
> 1         1 GO:0003674       ND       MF
> 2         1 GO:0005576      IDA       CC
> 3         1 GO:0005615      IDA       CC
> 4         1 GO:0008150       ND       BP
> 5         1 GO:0070062      IDA       CC
> 6         1 GO:0072562      IDA       CC
> 7         1 GO:0003674       ND       MF
> 8         1 GO:0005576      IDA       CC
> 9         1 GO:0005615      IDA       CC
> 10        1 GO:0008150       ND       BP
> 11        1 GO:0070062      IDA       CC
> 12        1 GO:0072562      IDA       CC
> 13        1 GO:0003674       ND       MF
> 14        1 GO:0005576      IDA       CC
> 15        1 GO:0005615      IDA       CC
> 16        1 GO:0008150       ND       BP
> 17        1 GO:0070062      IDA       CC
> 18        1 GO:0072562      IDA       CC
>
> So, two questions.
>
>
>     1. Should duplicate keys be allowed, or should duplicates be removed
>     before querying the database, preferably with a message saying that dups
>     were removed?
>     2. If the answer to #1 is yes, then to be consistent, I will just commit
>     the patch I have made to both devel and release.
>
> Best,
>
> Jim
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list