[Bioc-devel] Behavior of select function in AnnotationDbi

Fri Nov 20 23:37:42 CET 2015

I think the answer to 1 should be yes, duplicate keys are allowed. For instance, a vector of ids and a factor that groups the ids somehow (e.g., by experiment), with ids unique in each group.

So I'm for step #2.

Martin
________________________________________
From: Bioc-devel [bioc-devel-bounces at r-project.org] on behalf of James W. MacDonald [jmacdon at uw.edu]
Sent: Friday, November 20, 2015 5:30 PM
To: bioc-devel at r-project.org
Subject: [Bioc-devel] Behavior of select function in AnnotationDbi

There is an inconsistency in how select() works in AnnotationDbi when a
user passes in duplicated keys to be mapped, depending on if the mapping is
1:1 or 1:many. It's easiest to show using an example.

> select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
'select()' returned many:1 mapping between keys and columns
  ENTREZID SYMBOL
1        1   A1BG
2        1   A1BG
3        1   A1BG

> select(org.Hs.eg.db, rep("1", 3), "GO")
'select()' returned many:many mapping between keys and columns
  ENTREZID         GO EVIDENCE ONTOLOGY
1        1 GO:0003674       ND       MF
2        1 GO:0003674       ND       MF
3        1 GO:0003674       ND       MF

This is obviously a bug. A single query for that ID results in this:

> select(org.Hs.eg.db, "1", "GO")
'select()' returned 1:many mapping between keys and columns
  ENTREZID         GO EVIDENCE ONTOLOGY
1        1 GO:0003674       ND       MF
2        1 GO:0005576      IDA       CC
3        1 GO:0005615      IDA       CC
4        1 GO:0008150       ND       BP
5        1 GO:0070062      IDA       CC
6        1 GO:0072562      IDA       CC

So the returned results are completely borked.

However, the question I have is what should be returned? To be consistent
with the first example, it should be the output expected for a single key,
repeated three times, which I have patched AnnotationDbi to do:

> select(org.Hs.eg.db, rep("1", 3), "GO")
'select()' returned many:many mapping between keys and columns
   ENTREZID         GO EVIDENCE ONTOLOGY
1         1 GO:0003674       ND       MF
2         1 GO:0005576      IDA       CC
3         1 GO:0005615      IDA       CC
4         1 GO:0008150       ND       BP
5         1 GO:0070062      IDA       CC
6         1 GO:0072562      IDA       CC
7         1 GO:0003674       ND       MF
8         1 GO:0005576      IDA       CC
9         1 GO:0005615      IDA       CC
10        1 GO:0008150       ND       BP
11        1 GO:0070062      IDA       CC
12        1 GO:0072562      IDA       CC
13        1 GO:0003674       ND       MF
14        1 GO:0005576      IDA       CC
15        1 GO:0005615      IDA       CC
16        1 GO:0008150       ND       BP
17        1 GO:0070062      IDA       CC
18        1 GO:0072562      IDA       CC

So, two questions.

   1. Should duplicate keys be allowed, or should duplicates be removed
   before querying the database, preferably with a message saying that dups
   were removed?
   2. If the answer to #1 is yes, then to be consistent, I will just commit
   the patch I have made to both devel and release.

Best,

Jim

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.