[Bioc-devel] Behavior of select function in AnnotationDbi

James W. MacDonald jmacdon at uw.edu
Fri Nov 20 23:30:44 CET 2015


There is an inconsistency in how select() works in AnnotationDbi when a
user passes in duplicated keys to be mapped, depending on if the mapping is
1:1 or 1:many. It's easiest to show using an example.

> select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
'select()' returned many:1 mapping between keys and columns
  ENTREZID SYMBOL
1        1   A1BG
2        1   A1BG
3        1   A1BG

> select(org.Hs.eg.db, rep("1", 3), "GO")
'select()' returned many:many mapping between keys and columns
  ENTREZID         GO EVIDENCE ONTOLOGY
1        1 GO:0003674       ND       MF
2        1 GO:0003674       ND       MF
3        1 GO:0003674       ND       MF

This is obviously a bug. A single query for that ID results in this:

> select(org.Hs.eg.db, "1", "GO")
'select()' returned 1:many mapping between keys and columns
  ENTREZID         GO EVIDENCE ONTOLOGY
1        1 GO:0003674       ND       MF
2        1 GO:0005576      IDA       CC
3        1 GO:0005615      IDA       CC
4        1 GO:0008150       ND       BP
5        1 GO:0070062      IDA       CC
6        1 GO:0072562      IDA       CC

So the returned results are completely borked.

However, the question I have is what should be returned? To be consistent
with the first example, it should be the output expected for a single key,
repeated three times, which I have patched AnnotationDbi to do:

> select(org.Hs.eg.db, rep("1", 3), "GO")
'select()' returned many:many mapping between keys and columns
   ENTREZID         GO EVIDENCE ONTOLOGY
1         1 GO:0003674       ND       MF
2         1 GO:0005576      IDA       CC
3         1 GO:0005615      IDA       CC
4         1 GO:0008150       ND       BP
5         1 GO:0070062      IDA       CC
6         1 GO:0072562      IDA       CC
7         1 GO:0003674       ND       MF
8         1 GO:0005576      IDA       CC
9         1 GO:0005615      IDA       CC
10        1 GO:0008150       ND       BP
11        1 GO:0070062      IDA       CC
12        1 GO:0072562      IDA       CC
13        1 GO:0003674       ND       MF
14        1 GO:0005576      IDA       CC
15        1 GO:0005615      IDA       CC
16        1 GO:0008150       ND       BP
17        1 GO:0070062      IDA       CC
18        1 GO:0072562      IDA       CC

So, two questions.


   1. Should duplicate keys be allowed, or should duplicates be removed
   before querying the database, preferably with a message saying that dups
   were removed?
   2. If the answer to #1 is yes, then to be consistent, I will just commit
   the patch I have made to both devel and release.

Best,

Jim



-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list