[Bioc-devel] Behavior of select function in AnnotationDbi
Hervé Pagès
hpages at fredhutch.org
Sat Nov 21 00:32:40 CET 2015
On 11/20/2015 03:21 PM, Hervé Pagès wrote:
> Hi Jim,
>
> I think we should choose the biomaRt model, that is, duplicated are
> allowed but silently ignored.
>
> Note that this is also the SQL model. When you do
>
> SELECT * FROM ... WHERE key IN c('key1', 'key2', ...)
^
I meant:
SELECT * FROM ... WHERE key IN ('key1', 'key2', ...)
No c() (too much R lately...)
H.
>
> duplicated keys don't generate duplicates in the output.
>
> Also note that, like SELECT, even if the keys supplied to
> biomaRt::getBM() (via the 'values' arg) don't contain duplicates
> and if all the mappings are 1-to-1, biomaRt::getBM() is not
> guarantee to preserve order.
>
> Generally speaking having duplicates in the input produce duplicates
> in the output is useful in vectorized operations when the output
> is expected to be parallel to the input. Vectorized operations also
> need to propagate NAs and to preserve order. However, like SELECT
> and biomaRt::getBM(), select() cannot produce an output that is
> parallel to the input *in general*.
>
> It seems that the current philosophy for select() is to emit a note
> or a warning every time the output is not parallel to the input.
> Personally I find this too noisy and not that useful.
>
> Thanks,
> H.
>
>
> On 11/20/2015 02:30 PM, James W. MacDonald wrote:
>> There is an inconsistency in how select() works in AnnotationDbi when a
>> user passes in duplicated keys to be mapped, depending on if the
>> mapping is
>> 1:1 or 1:many. It's easiest to show using an example.
>>
>>> select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
>> 'select()' returned many:1 mapping between keys and columns
>> ENTREZID SYMBOL
>> 1 1 A1BG
>> 2 1 A1BG
>> 3 1 A1BG
>>
>>> select(org.Hs.eg.db, rep("1", 3), "GO")
>> 'select()' returned many:many mapping between keys and columns
>> ENTREZID GO EVIDENCE ONTOLOGY
>> 1 1 GO:0003674 ND MF
>> 2 1 GO:0003674 ND MF
>> 3 1 GO:0003674 ND MF
>>
>> This is obviously a bug. A single query for that ID results in this:
>>
>>> select(org.Hs.eg.db, "1", "GO")
>> 'select()' returned 1:many mapping between keys and columns
>> ENTREZID GO EVIDENCE ONTOLOGY
>> 1 1 GO:0003674 ND MF
>> 2 1 GO:0005576 IDA CC
>> 3 1 GO:0005615 IDA CC
>> 4 1 GO:0008150 ND BP
>> 5 1 GO:0070062 IDA CC
>> 6 1 GO:0072562 IDA CC
>>
>> So the returned results are completely borked.
>>
>> However, the question I have is what should be returned? To be consistent
>> with the first example, it should be the output expected for a single
>> key,
>> repeated three times, which I have patched AnnotationDbi to do:
>>
>>> select(org.Hs.eg.db, rep("1", 3), "GO")
>> 'select()' returned many:many mapping between keys and columns
>> ENTREZID GO EVIDENCE ONTOLOGY
>> 1 1 GO:0003674 ND MF
>> 2 1 GO:0005576 IDA CC
>> 3 1 GO:0005615 IDA CC
>> 4 1 GO:0008150 ND BP
>> 5 1 GO:0070062 IDA CC
>> 6 1 GO:0072562 IDA CC
>> 7 1 GO:0003674 ND MF
>> 8 1 GO:0005576 IDA CC
>> 9 1 GO:0005615 IDA CC
>> 10 1 GO:0008150 ND BP
>> 11 1 GO:0070062 IDA CC
>> 12 1 GO:0072562 IDA CC
>> 13 1 GO:0003674 ND MF
>> 14 1 GO:0005576 IDA CC
>> 15 1 GO:0005615 IDA CC
>> 16 1 GO:0008150 ND BP
>> 17 1 GO:0070062 IDA CC
>> 18 1 GO:0072562 IDA CC
>>
>> So, two questions.
>>
>>
>> 1. Should duplicate keys be allowed, or should duplicates be removed
>> before querying the database, preferably with a message saying
>> that dups
>> were removed?
>> 2. If the answer to #1 is yes, then to be consistent, I will just
>> commit
>> the patch I have made to both devel and release.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list