[Bioc-devel] Behavior of select function in AnnotationDbi

Mon Nov 23 19:26:22 CET 2015

On Mon, Nov 23, 2015 at 1:11 PM, Morgan, Martin <
Martin.Morgan at roswellpark.org> wrote:

> I would rather mapIds() continue to operate on a single column, return a
> named character vector, and by default provide a 1:1 relationship between
> input and output. multiVals=CharacterList does actually return a 1:many
> mapping in a way that retains parallel structure (I guess, maybe module
> limitations noted below and others).
>
> Personally, I do not favor the message() associated with select();
> select() is behaving as documented. A warning seems unnecessary --
> "warning! I'm doing what I'm supposed to do!". If there is a message of
> some sort, I'd rather it was consistently presenting information. I also
> like message() _because_ it's presented when the issue arises, rather than
> out of context, maybe for the same reasons I find top-posting in email
> responses [sic] so irritating. If the documented behavior of select() is
> fundamentally unsatisfactory, then yes we should change the documented
> behavior rather than emitting warnings.
>
> select() could be updated to accept the equivalent of the multiVals
> argument. select() could also be updated to always return a DataFrame or to
> return a DataFrame when multiVals is specified, though one does like a
> function to return a consistent data type. The original choice to use
> data.frame was from informal observation that the classes enabled by
> DataFrame (e.g., CharacterList) pose problems for less-experienced [by this
> I mean a broad swath of Bioc] users, and the annotation resources should be
> as accessible as possible.
>

Understood.  However, I would be in favor of options or new query methods
that would return the
DataFrame and List structures.  We need more mileage on them and they
provide IMHO rational
solutions to the mapping problems.

>
> I see no benefit in NOT ordering the return values in the same order as
> the input keys. Likewise, I see no value in dropping support for duplicate
> or NA keys.
>
> I view both of the following treatments of NA in mapIds() as bugs; they
> should return named character vectors mapping <NA> to NA.
>
> > mapIds(org.Hs.eg.db, c("BRCA1", NA), "ENTREZID", "SYMBOL")
> 'select()' returned 1:1 mapping between keys and columns
> $BRCA1
> [1] "672"
>
> $<NA>
> NULL
> > mapIds(org.Hs.eg.db, NA_character_, "ENTREZID", "SYMBOL")
> Error in .testForValidKeys(x, keys, keytype, fks) :
>   None of the keys entered are valid keys for 'SYMBOL'. Please use the
> keys method to see a listing of valid arguments.
>
> Reporting mapping when mapIds() is invoked, with or without the multiVals=
> argument, also seems unnecessary.
>
> > mapIds(org.Hs.eg.db, "BRCA1", "GO", "SYMBOL", multiVals="first")
> 'select()' returned 1:many mapping between keys and columns
>        BRCA1
> "GO:0000151"
>
> Martin
>
> ________________________________________
> From: Bioc-devel [bioc-devel-bounces at r-project.org] on behalf of James W.
> MacDonald [jmacdon at uw.edu]
> Sent: Monday, November 23, 2015 11:16 AM
> To: Hervé Pagès
> Cc: bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] Behavior of select function in AnnotationDbi
>
> Hi Hervé,
>
> Hmmm. Persuasive arguments. I like it!
>
> I'm pretty close to thinking you are right about this. However there is one
> small stumbling block. While select() allows for the look up of multiple
> columns simultaneously, mapIds() does not. But I do like your idea of
> having a function that guarantees parallelism, so let's consider this.
>
> It wouldn't take much to convert mapIds() to accept multiple columns, and
> it would be reasonable to keep the default of using the first of duplicate
> returned values. This makes more sense for some annotations (RefSeq), than
> others (GO), but whatever. In that case it would return a data.frame, just
> like select() does now, but with a guarantee of parallelism with the input.
> That's easy enough to do. It would often not be a good idea for an end user
> to do that, but again we have to let people fail if they refuse to even
> consider what they are doing.
>
> But this brings up a pretty interesting option, for those who don't want to
> get rid of the ambiguity inherent in getting multiple mappings returned, or
> who really should be getting those multiple things back (e.g., GO). It
> might be a smarter play to change the default returned object from a
> vector, where we have naively taken the first item returned, and instead
> return a CharacterList. For multiple input columns, we would then return a
> DataFrame. We still have 1:1 mapping of inputs to rows of the output, but
> now we allow multiple values per column. There are lots of things people
> can do with that while within R, so that by itself is pretty useful, plus
> we haven't done anything that might be unwise.
>
> Does that seem reasonable? It's sort of a big change to the output for both
> select() and mapIds(), so it might be a good idea (as you suggested) to
> have some sort of warning for the next release, telling end users that
> things have changed, and they should take note.
>
> Jim
>
>
>
> On Sun, Nov 22, 2015 at 7:07 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
>
> > Hi Jim,
> >
> > I understand the convenience of having an output that is parallel to
> > the input, especially in the context of the primary use case which is
> > mapping/cbinding ids to a rectangular object. However for that use
> > case, "as close to parallel as possible" is not good enough. IMO the
> > user needs a tool that guarantees parallelism.
> >
> > mapIds() seems to be such tool. It seems to produce an output that is
> > always parallel to the input, at least by default (i.e. when the
> > 'multiVals' argument is not specified). BTW this is such an important
> > property that it would deserve to be explicitly stated somewhere in
> > the man page.
> >
> > So the user already has mapIds() for mapping/cbinding ids to a
> > rectangular object.
> >
> > mapIds() builds on top of select() and achieves parallelism by adding a
> > mechanism to resolve ambiguous mappings (via the 'multiVals' argument).
> >
> > In contrast, select() currently doesn't provide such mechanism. Instead
> > it systematically prints a message about the cardinality of the mapping
> > (1:1, 1:many, many:1, or many:many). This is not very useful. What's
> > really important is whether the output is parallel to the input or not.
> > If it is (i.e. if the mapping is 1:1 or many:1), then everything is fine
> > and no message or warning is needed. If it's not (i.e. if the mapping is
> > 1:many or many:many), then it makes sense to warn the user. A warning
> > seems more appropriate than a message for this because it's about a
> > potential gotcha with the output. Also, unlike a message, a warning is
> > delayed and printed at the bottom of the output which makes it less
> > likely to be missed, especially if the output fits more than 1 screen.
> >
> > Anyway, the current select() is in this gray area where it sometimes
> > returns something that I can use for mapping/cbinding ids to my
> > rectangular object, and sometimes returns something that I can't use
> > for that (unless I transform it first). There is no way to predict that.
> > It's only a question of luck. If I'm out of luck, then I use mapIds()
> > (and the warning should tell me that), which is what I should have used
> > in the 1st place. In other words, if I care about parallelism, then
> > select() is the wrong tool.
> >
> > This is why I was suggesting that a simple and straightforward approach
> > would be to just give up on parallelism for select() i.e. make it behave
> > like SQL SELECT and biomaRt::getBM(), stop worrying about the output not
> > being parallel to the input, and make it clear in the man page that the
> > user should not have such expectation (if parallel output is important
> > to him/her, then s/he should use mapIds()). With this approach select()
> > is a little bit lower level than it is right now and mapIds() is the
> > higher level tool that builds on top of it to achieve parallelism. This
> > low-level select() is still extremely useful and adds a lot of value
> > over letting the users write their own SQL queries. It is the same
> > level of usefulness that biomaRt::getBM() adds for querying Ensembl
> > (compared to typing SQL queries in an MySQL client to query the Ensembl
> > MySQL db). With biomaRt::getBM() and with this low-level select(), you
> > don't need to know any SQL. All you need to know is that the output
> > is not guaranteed to be parallel to the input.
> >
> > Another approach is the exact opposite one: make select() reliable for
> > the mapping/cbinding use case i.e. make it produce an output that is
> > always parallel to the input by default. That means it would also need
> > a mechanism to resolve ambiguous mappings like mapIds() does (e.g.
> > via a 'multiVals' argument).
> >
> > Right now select() sits in the middle of these 2 approaches.
> >
> > If select()'s behavior cannot be changed, maybe we can try to make it
> > a little less noisy i.e. have warnings (and not messages) only when
> > the output is not parallel to the input. Also how about having a
> > warning that suggests the use of mapIds()? Something like "Hey, your
> > output is not parallel to your input! Use mapIds() if that matters
> > to you."
> >
> > Thanks,
> > H.
> >
> >
> >
> > On 11/21/2015 08:28 AM, James W. MacDonald wrote:
> >
> >> Hi Hervé,
> >>
> >> I think you make a valid point, but I am not sure it conforms to either
> >> the historical expectations for the annotation packages, nor the
> >> expected use case. In other words, the annotation packages have, since
> >> the beginning, returned something that was as close to parallel to the
> >> input as possible. This started with the env-based packages, where e.g.,
> >> mget() returned things parallel to the input, rather than randomly, as
> >> the SQL model would have done. This was actually true for biomaRt back
> >> in the day when you could use either the MySQL or CURL interfaces. The
> >> SQL interface returned something parallel to input, and it was a bit of
> >> a change when Steffen went to just using the CURL interface, and people
> >> had to learn to add the query key into the return in order to re-sort.
> >>
> >> So from a historical perspective it would be a real change to go from
> >> returning something as close to parallel to the input to a SQL type
> >> model. I agree that makes sense for those who are versed in SQL, but if
> >> we were really trying to help people familiar with SQL, we would just be
> >> giving them the RSQLite DB and telling them to have at it, no? The whole
> >> rationale behind wrapping the DB in a package is to shield the end user
> >> from having to understand SQL, which includes knowing that SQL doesn't
> >> enforce the order of the returned data.
> >>
> >>  From a use case perspective, the only use case that I am familiar with
> >> is the one where people have a rectangular array of data, where the rows
> >> are genes or whatnot, and they want to append additional columns to the
> >> array, where the added columns conform to the existing array. This is
> >> not always possible to do in the case of 1:many mappings, obviously, but
> >> that is what mapIds() is for - there is an implicit contract there that
> >> you can input a set of keys, and get out an ordered result that you can
> >> then tack onto the side of your array of data and go forward. I think
> >> that is a really useful thing to be able to do. It is obviously quite
> >> easy to take two matrices that are ordered differently, reorder one, and
> >> then bind together. But if you don't have to do that, then it's an
> >> improvement, and quite helpful for naive users who might not have
> >> considered that the returned result might not actually be parallel to
> >> their input.
> >>
> >> And being the one who convinced Marc to put the warnings back in, I will
> >> have to respectively disagree about whether or not it is useful to emit
> >> the warnings. Well, I agree it isn't useful at all for those of us who
> >> already know that the results might not be parallel. But that isn't the
> >> reason for a warning is it? To warn those who already understand what is
> >> going on? My argument is (and was) that a warning is intended to tell
> >> people that something that they /might not have expected/ just happened.
> >> I don't think it is unreasonable to think that a significant proportion
> >> of naive end users would get tripped up by that if the warning weren't
> >> there. They still might be, but at least we gave them a chance.
> >>
> >> Anyway, it took like three lines of code to make it work consistently
> >> for many:many mappings, so that's what I did.
> >>
> >> Best,
> >>
> >> Jim
> >>
> >>
> >>
> >> On Fri, Nov 20, 2015 at 6:32 PM, Hervé Pagès <hpages at fredhutch.org
> >> <mailto:hpages at fredhutch.org>> wrote:
> >>
> >>     On 11/20/2015 03:21 PM, Hervé Pagès wrote:
> >>
> >>         Hi Jim,
> >>
> >>         I think we should choose the biomaRt model, that is, duplicated
> >> are
> >>         allowed but silently ignored.
> >>
> >>         Note that this is also the SQL model. When you do
> >>
> >>             SELECT * FROM ... WHERE key IN c('key1', 'key2', ...)
> >>
> >>                                          ^
> >>     I meant:
> >>
> >>           SELECT * FROM ... WHERE key IN ('key1', 'key2', ...)
> >>
> >>     No c() (too much R lately...)
> >>
> >>     H.
> >>
> >>
> >>
> >>         duplicated keys don't generate duplicates in the output.
> >>
> >>         Also note that, like SELECT, even if the keys supplied to
> >>         biomaRt::getBM() (via the 'values' arg) don't contain duplicates
> >>         and if all the mappings are 1-to-1, biomaRt::getBM() is not
> >>         guarantee to preserve order.
> >>
> >>         Generally speaking having duplicates in the input produce
> >> duplicates
> >>         in the output is useful in vectorized operations when the output
> >>         is expected to be parallel to the input. Vectorized operations
> >> also
> >>         need to propagate NAs and to preserve order. However, like
> SELECT
> >>         and biomaRt::getBM(), select() cannot produce an output that is
> >>         parallel to the input *in general*.
> >>
> >>         It seems that the current philosophy for select() is to emit a
> >> note
> >>         or a warning every time the output is not parallel to the input.
> >>         Personally I find this too noisy and not that useful.
> >>
> >>         Thanks,
> >>         H.
> >>
> >>
> >>         On 11/20/2015 02:30 PM, James W. MacDonald wrote:
> >>
> >>             There is an inconsistency in how select() works in
> >>             AnnotationDbi when a
> >>             user passes in duplicated keys to be mapped, depending on if
> >> the
> >>             mapping is
> >>             1:1 or 1:many. It's easiest to show using an example.
> >>
> >>                 select(org.Hs.eg.db, rep("1", 3), "SYMBOL")
> >>
> >>             'select()' returned many:1 mapping between keys and columns
> >>                 ENTREZID SYMBOL
> >>             1        1   A1BG
> >>             2        1   A1BG
> >>             3        1   A1BG
> >>
> >>                 select(org.Hs.eg.db, rep("1", 3), "GO")
> >>
> >>             'select()' returned many:many mapping between keys and
> columns
> >>                 ENTREZID         GO EVIDENCE ONTOLOGY
> >>             1        1 GO:0003674       ND       MF
> >>             2        1 GO:0003674       ND       MF
> >>             3        1 GO:0003674       ND       MF
> >>
> >>             This is obviously a bug. A single query for that ID results
> >>             in this:
> >>
> >>                 select(org.Hs.eg.db, "1", "GO")
> >>
> >>             'select()' returned 1:many mapping between keys and columns
> >>                 ENTREZID         GO EVIDENCE ONTOLOGY
> >>             1        1 GO:0003674       ND       MF
> >>             2        1 GO:0005576      IDA       CC
> >>             3        1 GO:0005615      IDA       CC
> >>             4        1 GO:0008150       ND       BP
> >>             5        1 GO:0070062      IDA       CC
> >>             6        1 GO:0072562      IDA       CC
> >>
> >>             So the returned results are completely borked.
> >>
> >>             However, the question I have is what should be returned? To
> >>             be consistent
> >>             with the first example, it should be the output expected for
> >>             a single
> >>             key,
> >>             repeated three times, which I have patched AnnotationDbi to
> >> do:
> >>
> >>                 select(org.Hs.eg.db, rep("1", 3), "GO")
> >>
> >>             'select()' returned many:many mapping between keys and
> columns
> >>                  ENTREZID         GO EVIDENCE ONTOLOGY
> >>             1         1 GO:0003674       ND       MF
> >>             2         1 GO:0005576      IDA       CC
> >>             3         1 GO:0005615      IDA       CC
> >>             4         1 GO:0008150       ND       BP
> >>             5         1 GO:0070062      IDA       CC
> >>             6         1 GO:0072562      IDA       CC
> >>             7         1 GO:0003674       ND       MF
> >>             8         1 GO:0005576      IDA       CC
> >>             9         1 GO:0005615      IDA       CC
> >>             10        1 GO:0008150       ND       BP
> >>             11        1 GO:0070062      IDA       CC
> >>             12        1 GO:0072562      IDA       CC
> >>             13        1 GO:0003674       ND       MF
> >>             14        1 GO:0005576      IDA       CC
> >>             15        1 GO:0005615      IDA       CC
> >>             16        1 GO:0008150       ND       BP
> >>             17        1 GO:0070062      IDA       CC
> >>             18        1 GO:0072562      IDA       CC
> >>
> >>             So, two questions.
> >>
> >>
> >>                  1. Should duplicate keys be allowed, or should
> >>             duplicates be removed
> >>                  before querying the database, preferably with a message
> >>             saying
> >>             that dups
> >>                  were removed?
> >>                  2. If the answer to #1 is yes, then to be consistent, I
> >>             will just
> >>             commit
> >>                  the patch I have made to both devel and release.
> >>
> >>             Best,
> >>
> >>             Jim
> >>
> >>
> >>
> >>
> >>
> >>     --
> >>     Hervé Pagès
> >>
> >>     Program in Computational Biology
> >>     Division of Public Health Sciences
> >>     Fred Hutchinson Cancer Research Center
> >>     1100 Fairview Ave. N, M1-B514
> >>     P.O. Box 19024
> >>     Seattle, WA 98109-1024
> >>
> >>     E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
> >>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> >>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> >>
> >>
> >>
> >>
> >> --
> >> James W. MacDonald, M.S.
> >> Biostatistician
> >> University of Washington
> >> Environmental and Occupational Health Sciences
> >> 4225 Roosevelt Way NE, # 100
> >> Seattle WA 98105-6099
> >>
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
>
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> This email message may contain legally privileged and/or confidential
> information.  If you are not the intended recipient(s), or the employee or
> agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited.  If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]