[BioC] countMatches() (was: table for GenomicRanges)

Mon Jan 7 22:46:21 CET 2013

On 01/07/2013 11:33 AM, Michael Lawrence wrote:
>
>
>
> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Michael,
>
>     I don't think "match" (the word) always has to mean "equality" either.
>     However having match() (the function) do "whole exact matching" (aka
>     "equality") for any kind of vector-like object has the advantage of:
>
>        (a) making it consistent with base::match() (?base::match is pretty
>            explicit about what the contract of match() is)
>
>
> (a) alone is obviously not enough.  We have many methods, like the set
> operations, that treat ranges specially.  Are we going to start moving
> everything toward the base behavior? And have rangeIntersect,
> rangeSetdiff, etc?
>
>        (b) preserving its relationship with ==, duplicated(), unique(),
>            etc...
>
>
> So it becomes consistent with duplicated/unique, but we lose consistency
> with the set operations.

Nope, we don't loose anything. Because match()/%in% were NOT consistent
with the set operations anyway, that is, 'intersect(x, y)' on
IRanges/GRanges objects was not doing 'x[x %in% y]' (%in% here being
the old %in%).

>
>        (c) not frustrating the user who needs something to do exact
>            matching on ranges (as I mentioned previously, if you take
>            match() away from him/her, s/he'll be left with nothing).
>
>
> No one has ever asked for match() to behave this way.

Here is my use case: internally findMatches()/countMatches() are
implemented on top of match(), the fixed match(). They work on any
object for which match() works. They would also work on objects for
which match() does the wrong thing but they would return something
wrong. They could be made ordinary functions, not generic (and they
will, but they temporarily need to be made generics with methods,
just to smooth the transition), because dispatch happens inside the
function when match() is called. In the man page for those functions
I can just say:

   findMatches(x, table): An enhanced version of ‘match’ that returns
                          all the matches in a Hits object.

and I'm done. It's clear and concise.

The implementation/documentation of findMatches()/countMatches() is
the typical illustration of why having methods that respect the
contract of the generic is a must.

The idea is to build on top of some basic building-blocks for which
the behavior is well-defined, consistent, predictable. It's sooo much
easier, and it's very healthy.

> There was a
> request for a way to tabulate identical ranges. It was a nice idea to
> extract the general "outer equal" findMatches function.

It's also a nice idea to have findMatches() and countMatches() aligned
with match().

> But the changes seem to be snow-balling.

No snow-balling. You cannot snow-ball too far anyway when you restore
consistency. But you can easily snow-ball very far when you go on the
other direction (there is no limits). Do I need to say that aiming for
consistency/predictability is a good goal in software design? It can
only make it *better* in all the meanings of the term: less bugs,
easier to maintain, easier to document, and easier to use in the long
run. Everybody wins. Even if you don't realize it now. Convenience is
also important, but less important than consistency/predictability.
As a matter of fact, an interesting and not immediately obvious side
effect of going consistent is that, in the long run (i.e. when the
software becomes bigger and more complex), it also gives you a form of
convenience for the end-user: documentation is simpler and easier to
read, and there are less special cases to remember.

> These types of changes mean a lot of
> maintenance work for the users. A deprecation cycle does not circumvent
> that.

I don't see why this change would be more work for the users than any
other change. Making RangedData fade away will certainly be a much
bigger one, will take much more time (maybe 2-3 years), and will
require a lot more maintenance work from us (mostly me) and from
the users.

FWIW, the change to match()/%in%  probably means more work for me than
for the users. There is a *lot* of stuff I had to put in place in
IRanges/GenomicRanges to make this transition smooth. But I truly
believe it was worth it. I also fixed all the BioC packages I found
that were affected by those changes (surprisingly, there were very
few: only 5). I could have missed some. Please let me know if that
is the case and I'll fix them too.

Thanks,
H.

>
>
>     IMO those advantages counterbalance *by far* the very little
>     convenience you get from having 'match(query, subject)' do
>     'findOverlaps(query, subject, select="first")' on
>     IRanges/GRanges objects. If you need to do that, just use the
>     latter, or, if you think that's still too much typing, define
>     a wrapper e.g. 'ovmatch(query, subject)'.
>
>     There are plenty of specialized tools around for doing
>     inexact/fuzzy/partial/overlap matching for many particular types
>     of vector-like objects: grep() and family, pmatch(), charmatch(),
>     agrep(), grepRaw(), matchPattern() and family, findOverlaps() and
>     family, findIntervals(), etc... For the reasons I mentioned
>     above, none of them should hijack match() to make it do some
>     particular type of inexact matching on some particular type of
>     objects. Even if, for that particular type of objects, doing that
>     particular type of inexact matching is more common than doing
>     exact matching.
>
>     H.
>
>
>
>     On 01/06/2013 05:39 PM, Michael Lawrence wrote:
>
>         I think having overlapsAny is a nice addition and helps make the API
>         more complete and explicit. Are you sure we need to change the
>         behavior
>         of the match method for this relatively uncommon use case?
>
>
>     Yes because otherwise users with a use case of doing match()
>
>     even if it's uncommon,
>
>
>         I don't think
>         "match" always has to mean "equality". It is a more general
>         concept in
>         my mind. The most common use case for matching ranges is overlap.
>
>
>     Of course "match" doesn't always have to mean equality. But of base
>
>
>         Michael
>
>
>         On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
>              Yes 'peaks %in% genes' is cute and was probably doing the
>         right thing
>              for most users (although not all). But 'exons %in% genes'
>         is cute too
>               and was probably doing the wrong thing for all users.
>         Advanced users
>              like you guys would have no problem switching to
>
>                 !is.na <http://is.na>
>         <http://is.na>(findOverlaps(__peaks, genes, type="within",
>              select="any"))
>
>              or
>
>                 !is.na <http://is.na>
>         <http://is.na>(findOverlaps(__peaks, genes, type="equal",
>
>              select="any"))
>
>              in case 'peaks %in% genes' was not doing exactly what you
>         wanted,
>              but most users would not find this particularly friendly. Even
>              worse, some users probably didn't realize that 'peaks %in%
>         genes'
>              was not doing exactly what they thought it did because
>         "peaks in
>              genes" in English suggests that the peaks are within the genes,
>              but it's not what 'peaks %in% genes' does.
>
>              Having overlapsAny(), with exactly the same extra arguments as
>              countOverlaps() and subsetByOverlaps() (i.e. 'maxgap',
>         'minoverlap',
>              'type', 'ignore.strand'), all of them documented (and with most
>              users more or less familiar with them already) has the
>         virtue to
>              expose the user to all the options from the very start, and to
>              help him/her make the right choice. Of course there will be
>         users
>              that don't want or don't have the time to read/think about
>         all the
>              options. Not a big deal: they'll just do
>         'overlapsAny(query, subject)',
>              which is not a lot more typing than 'query %in% subject',
>         especially
>              if they use tab completion.
>
>              It's true that it's more common to ask questions about
>         overlap than
>              about equality but there are some use cases for the latter
>         (as the
>              original thread shows). Until now, when you had such a use
>         case, you
>              could not use match() or %in%, which would have been the
>         natural things
>              to use, because they got hijacked to do something else, and
>         you were
>              left with nothing. Not a satisfying situation. So at a
>         minimum, we
>              needed to restore the true/real/original semantic of
>         match() to do
>              "equality" instead of "overlap". But it's hard to do this
>         for match()
>              and not do it for %in% too. For more than 99% of R users,
>         %in% is
>              just a simple wrapper for 'match(x, table, nomatch = 0) >
>         0' (this
>              is how it has been documented and implemented in base R for
>         many
>              years). Not maintaining this relationship between %in% and
>         match()
>              would only cause grief and frustration to newcomers to
>         Bioconductor.
>
>              H.
>
>
>
>              On 01/04/2013 03:32 PM, Cook, Malcolm wrote:
>
>                  Hiya again,
>
>                  I am definitely a late comer to BioC, so I definitely
>         easily
>                  defer to
>                  the tide of history.
>
>                  But I do think you miss my point Michael about the
>         proposed change
>                  making the relationship between %in% and match for
>         {G,I}Ranges{List}
>                  mimic that between other vectors, and I do think that
>         changing
>                  the API
>                  would make other late-comers take to BioC easier/faster.
>
>                  That said, I NEVER use %in% so I really have no stake
>         in the
>                  matter, and
>                  I DEFINITELY appreciate the argument to not changing
>         the API
>                  just for
>                  sematic sweetness.
>
>                  That that said, Herve is _/so good/_ about deprecations
>         and warnings
>
>                  that make such changes fairly easily digestible.
>
>                  That that that.... enough.... I bow out of this one....!!!!
>
>                  Always learning and Happy New Year to all lurkers,
>
>                  ~Malcolm
>
>                  *From:*Michael Lawrence [mailto:lawrence.michael at gene.
>         <mailto:lawrence.michael at gene.>____com
>
>                  <mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>>]
>                  *Sent:* Friday, January 04, 2013 5:11 PM
>                  *To:* Cook, Malcolm
>                  *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès
>                  (hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>); Tim
>
>
>                  Triche, Jr.; Vedran Franke; bioconductor at r-project.org
>         <mailto:bioconductor at r-project.org>
>                  <mailto:bioconductor at r-__project.org
>         <mailto:bioconductor at r-project.org>>
>                  *Subject:* Re: [BioC] countMatches() (was: table for
>         GenomicRanges)
>
>
>                  On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm
>         <MEC at stowers.org <mailto:MEC at stowers.org>
>                  <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>
>                  <mailto:MEC at stowers.org <mailto:MEC at stowers.org>
>         <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>>> wrote:
>
>                  Hiya,
>
>                  For what it is worth...
>
>                  I think the change to %in% is warranted.
>
>                  If I understand correctly, this change restores the
>         relationship
>                  between
>                  the semantics of `%in` and the semantics of `match`.
>
>                    From the docs:
>
>                      '"%in%" <- function(x, table) match(x, table,
>         nomatch = 0) > 0'
>
>                  Herve's change restores this relationship.
>
>
>                  match and %in% were initially consistent (both
>         considering any
>                  overlap);
>                  Herve has changed both of them together. The whole idea
>         behind
>                  IRanges
>                  is that ranges are special data types with special
>         semantics. We
>                  have
>                  reimplemented much of the existing R vector API using those
>                  semantics;
>                  this extends beyond match/%in%. I am hesitant about
>         making such
>                  sweeping
>                  changes to the API so late in the life-cycle of the
>         package.
>                  There was a
>                  feature request for a way to count identical ranges in
>         a set of
>                  ranges.
>                  Let's please not get carried away and start redesigning
>         the API
>                  for this
>                  one, albeit useful, request. There are all sorts of
>                  inconsistencies in
>                  the API, and many of them were conscious decisions that
>         considered
>                  practical use cases.
>
>                  Michael
>
>
>                       Herve, I suspect you were you as a result able to
>                  completely drop
>                       all the `%in%,BiocClass1,BiocClass2` definitions
>         and depend
>                  upon
>                       base::%in%
>
>                       Am I right?
>
>                       If so, may I suggest that Herve stay the course,
>         with the
>                  addition of
>                          '"%ol%" <- function(a, b) findOverlaps(a, b,
>         maxgap=0L,
>                       minoverlap=1L, type='any', select='all') > 0'
>
>                       This would provide a perspicacious idiom, thereby
>                  optimizing the API
>                       for Michaels observed common use case.
>
>                       Just sayin'
>
>                       ~Malcolm
>
>
>                         .-----Original Message-----
>                         .From: bioconductor-bounces at r-____project.org
>         <mailto:bioconductor-bounces at r-__project.org>
>                  <mailto:bioconductor-bounces at __r-project.org
>         <mailto:bioconductor-bounces at r-project.org>>
>                       <mailto:bioconductor-bounces@
>         <mailto:bioconductor-bounces@>____r-project.org
>         <http://r-project.org>
>                  <mailto:bioconductor-bounces at __r-project.org
>         <mailto:bioconductor-bounces at r-project.org>>>
>                       [mailto:bioconductor-bounces@
>         <mailto:bioconductor-bounces@>____r-project.org
>         <http://r-project.org>
>                  <mailto:bioconductor-bounces at __r-project.org
>         <mailto:bioconductor-bounces at r-project.org>>
>
>                       <mailto:bioconductor-bounces@
>         <mailto:bioconductor-bounces@>____r-project.org
>         <http://r-project.org>
>
>                  <mailto:bioconductor-bounces at __r-project.org
>         <mailto:bioconductor-bounces at r-project.org>>>] On Behalf Of Sean
>                  Davis
>                         .Sent: Friday, January 04, 2013 3:37 PM
>                         .To: Michael Lawrence
>                         .Cc: Tim Triche, Jr.; Vedran Franke;
>         bioconductor at r-project.org <mailto:bioconductor at r-project.org>
>         <mailto:bioconductor at r-__project.org
>         <mailto:bioconductor at r-project.org>>
>                       <mailto:bioconductor at r-____project.org
>         <mailto:bioconductor at r-__project.org>
>
>                  <mailto:bioconductor at r-__project.org
>         <mailto:bioconductor at r-project.org>>>
>
>                         .Subject: Re: [BioC] countMatches() (was: table for
>                  GenomicRanges)
>                         .
>                         .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence
>                         .<lawrence.michael at gene.com
>         <mailto:lawrence.michael at gene.com>
>                  <mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>>
>                  <mailto:lawrence.michael at gene.
>         <mailto:lawrence.michael at gene.>____com
>
>                  <mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>>>> wrote:
>                         .> The change to the behavior of %in% is a
>         pretty big
>                  one. Are you
>                       thinking
>                         .> that all set-based operations should behave
>         this way? For
>                       example, setdiff
>                         .> and intersect? I really liked the syntax of
>         "peaks
>                  %in% genes".
>                       In my
>                         .> experience, it's way more common to ask questions
>                  about overlap
>                       than about
>                         .> equality, so I'd rather optimize the API for
>         that use
>                  case. But
>                       again,
>                         .> that's just my personal bias.
>                         .
>                         .For what it is worth, I share Michael's
>         personal bias here.
>                         .
>                         .Sean
>                         .
>                         .
>                         .> Michael
>                         .>
>                         .>
>                         .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès
>                  <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>                       <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>>> wrote:
>                         .>
>                         .>> Hi,
>                         .>>
>                         .>> I added findMatches() and countMatches() to the
>                  latest IRanges /
>                         .>> GenomicRanges packages (in BioC devel only).
>                         .>>
>                         .>>   findMatches(x, table): An enhanced version of
>                  ‘match’ that
>                         .>>           returns all the matches in a Hits
>         object.
>                         .>>
>                         .>>   countMatches(x, table): Returns an integer
>         vector
>                  of the length
>                         .>>           of ‘x’, containing the number of
>         matches in
>                  ‘table’ for
>                         .>>           each element in ‘x’.
>                         .>>
>
>                         .>> countMatches() is what you can use to
>                  tally/count/tabulate
>                       (choose your
>
>                         .>> preferred term) the unique elements in a
>         GRanges object:
>                         .>>
>                         .>>   library(GenomicRanges)
>                         .>>   set.seed(33)
>                         .>>   gr <- GRanges("chr1",
>                  IRanges(sample(15,20,replace=*____*TRUE),
>
>                       width=5))
>                         .>>
>                         .>> Then:
>                         .>>
>                         .>>   > gr_levels <- sort(unique(gr))
>                         .>>   > countMatches(gr_levels, gr)
>                         .>>    [1] 1 1 1 2 4 2 2 1 2 2 2
>                         .>>
>                         .>> Note that findMatches() and countMatches()
>         also work on
>                       IRanges and
>                         .>> DNAStringSet objects, as well as on ordinary
>         atomic
>                  vectors:
>                         .>>
>                         .>>   library(hgu95av2probe)
>                         .>>   library(Biostrings)
>                         .>>   probes <- DNAStringSet(hgu95av2probe)
>                         .>>   unique_probes <- unique(probes)
>                         .>>   count <- countMatches(unique_probes, probes)
>                         .>>   max(count)  # 7
>                         .>>
>                         .>> I made other changes in
>         IRanges/GenomicRanges so that
>                  the notion
>                         .>> of "match" between elements of a vector-like
>         object now
>                       consistently
>                         .>> means "equality" instead of "overlap", even for
>                  range-based
>                       objects
>                         .>> like IRanges or GRanges objects. This notion of
>                  "equality" is the
>                         .>> same that is used by ==. The most visible
>         consequence
>                  of those
>                         .>> changes is that using %in% between 2 IRanges or
>                  GRanges objects
>                         .>> 'query' and 'subject' in order to do
>         overlaps was
>                  replaced by
>                         .>> overlapsAny(query, subject).
>                         .>>
>                         .>>   overlapsAny(query, subject): Finds the
>         ranges in
>                  ‘query’ that
>                         .>>      overlap any of the ranges in ‘subject’.
>                         .>>
>
>                         .>> There are warnings and deprecation messages
>         in place
>                  to help
>                       smooth
>
>                         .>> the transition.
>                         .>>
>                         .>> Cheers,
>                         .>> H.
>                         .>>
>                         .>> --
>                         .>> Hervé Pagès
>                         .>>
>                         .>> Program in Computational Biology
>                         .>> Division of Public Health Sciences
>                         .>> Fred Hutchinson Cancer Research Center
>                         .>> 1100 Fairview Ave. N, M1-B514
>                         .>> P.O. Box 19024
>                         .>> Seattle, WA 98109-1024
>                         .>>
>                         .>> E-mail: hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>>
>                  <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>>
>
>                         .>> Phone: (206) 667-5791
>         <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
>                  <tel:%28206%29%20667-5791>
>                         .>> Fax: (206) 667-1319
>         <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>                  <tel:%28206%29%20667-1319>
>
>                         .>>
>                         .>
>                         .>         [[alternative HTML version deleted]]
>                         .>
>                         .>
>                         .>
>         ___________________________________________________
>
>                         .> Bioconductor mailing list
>                         .> Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>
>                  <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>                  <mailto:Bioconductor at r-____project.org
>         <mailto:Bioconductor at r-__project.org>
>                  <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>>
>
>                         .>
>         https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                         .> Search the archives:
>         http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>                         .
>                         .___________________________________________________
>
>                         .Bioconductor mailing list
>                         .Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>
>                  <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>                  <mailto:Bioconductor at r-____project.org
>         <mailto:Bioconductor at r-__project.org>
>                  <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>>
>
>
>           .https://stat.ethz.ch/mailman/____listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>                  <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                         .Search the archives:
>         http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>
>         <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>              --
>              Hervé Pagès
>
>              Program in Computational Biology
>              Division of Public Health Sciences
>              Fred Hutchinson Cancer Research Center
>              1100 Fairview Ave. N, M1-B514
>              P.O. Box 19024
>              Seattle, WA 98109-1024
>
>              E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>              Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         <tel:%28206%29%20667-5791>
>              Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>         <tel:%28206%29%20667-1319>
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319