[BioC] countMatches() (was: table for GenomicRanges)

Tue Jan 8 18:20:12 CET 2013

Hi Tim,

I could add the %ov% operator as a replacement for the old %in%. So you
would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just
be a convenience wrapper for 'overlapsAny(peaks, genes)'.

Cheers,
H.

On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote:
> So why not leave %in% as it was and transition everything forward to
> explicitly using {  `%within%`, `%overlaps%`|`%overlapping%`, `%equals%`
> } such that
>
>    identical( x %within% table, countOverlaps(x, table, type='within') >
> 0 ) == TRUE
>    identical( x %overlaps% table, countOverlaps(x, table, type='any') >
> 0 ) == TRUE
>    identical( x %equals% table, countOverlaps(x, table, type='equal') >
> 0 ) == TRUE
>
> and for the time being,
>
>    identical( x %overlaps% table, countOverlaps(x, table, type='any') >
> 0 ) == TRUE ## but with a noisy nastygram that will halt if
> options("warn"=2)
> No breakage for %in% methods until such time as a full deprecation cycle
> has passed, and if the maintainers can't be arsed to do anything at all
> about the warnings by the second full release, then perhaps they don't
> really care that much after all.  Just a thought?
>
>  From someone (me) who has their own issues with keeping everything up
> to date and should know better.  If you want to use %in% for
>
>    peaks %in% genes (why on earth would you do this rather than peaks
> %in% promoters(genes), anyways?)
>
> then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS
> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more
> or less) happy.
>
>
>
> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>
>
>
>
>     On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
>
>         Hi Michael,
>
>         I don't think "match" (the word) always has to mean "equality"
>         either.
>         However having match() (the function) do "whole exact matching" (aka
>         "equality") for any kind of vector-like object has the advantage of:
>
>            (a) making it consistent with base::match() (?base::match is
>         pretty
>                explicit about what the contract of match() is)
>
>
>     (a) alone is obviously not enough.  We have many methods, like the
>     set operations, that treat ranges specially.  Are we going to start
>     moving everything toward the base behavior? And have rangeIntersect,
>     rangeSetdiff, etc?
>
>            (b) preserving its relationship with ==, duplicated(), unique(),
>                etc...
>
>
>     So it becomes consistent with duplicated/unique, but we lose
>     consistency with the set operations.
>
>            (c) not frustrating the user who needs something to do exact
>                matching on ranges (as I mentioned previously, if you take
>                match() away from him/her, s/he'll be left with nothing).
>
>
>     No one has ever asked for match() to behave this way. There was a
>     request for a way to tabulate identical ranges. It was a nice idea
>     to extract the general "outer equal" findMatches function. But the
>     changes seem to be snow-balling.  These types of changes mean a lot
>     of maintenance work for the users. A deprecation cycle does not
>     circumvent that.
>
>
>         IMO those advantages counterbalance *by far* the very little
>         convenience you get from having 'match(query, subject)' do
>         'findOverlaps(query, subject, select="first")' on
>         IRanges/GRanges objects. If you need to do that, just use the
>         latter, or, if you think that's still too much typing, define
>         a wrapper e.g. 'ovmatch(query, subject)'.
>
>         There are plenty of specialized tools around for doing
>         inexact/fuzzy/partial/overlap matching for many particular types
>         of vector-like objects: grep() and family, pmatch(), charmatch(),
>         agrep(), grepRaw(), matchPattern() and family, findOverlaps() and
>         family, findIntervals(), etc... For the reasons I mentioned
>         above, none of them should hijack match() to make it do some
>         particular type of inexact matching on some particular type of
>         objects. Even if, for that particular type of objects, doing that
>         particular type of inexact matching is more common than doing
>         exact matching.
>
>         H.
>
>
>
>         On 01/06/2013 05:39 PM, Michael Lawrence wrote:
>
>             I think having overlapsAny is a nice addition and helps make
>             the API
>             more complete and explicit. Are you sure we need to change
>             the behavior
>             of the match method for this relatively uncommon use case?
>
>
>         Yes because otherwise users with a use case of doing match()
>
>         even if it's uncommon,
>
>
>             I don't think
>             "match" always has to mean "equality". It is a more general
>             concept in
>             my mind. The most common use case for matching ranges is
>             overlap.
>
>
>         Of course "match" doesn't always have to mean equality. But of base
>
>
>             Michael
>
>
>             On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès
>             <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
>                  Yes 'peaks %in% genes' is cute and was probably doing
>             the right thing
>                  for most users (although not all). But 'exons %in%
>             genes' is cute too
>                   and was probably doing the wrong thing for all users.
>             Advanced users
>                  like you guys would have no problem switching to
>
>                     !is.na <http://is.na>
>             <http://is.na>(findOverlaps(__peaks, genes, type="within",
>                  select="any"))
>
>                  or
>
>                     !is.na <http://is.na>
>             <http://is.na>(findOverlaps(__peaks, genes, type="equal",
>
>                  select="any"))
>
>                  in case 'peaks %in% genes' was not doing exactly what
>             you wanted,
>                  but most users would not find this particularly
>             friendly. Even
>                  worse, some users probably didn't realize that 'peaks
>             %in% genes'
>                  was not doing exactly what they thought it did because
>             "peaks in
>                  genes" in English suggests that the peaks are within
>             the genes,
>                  but it's not what 'peaks %in% genes' does.
>
>                  Having overlapsAny(), with exactly the same extra
>             arguments as
>                  countOverlaps() and subsetByOverlaps() (i.e. 'maxgap',
>             'minoverlap',
>                  'type', 'ignore.strand'), all of them documented (and
>             with most
>                  users more or less familiar with them already) has the
>             virtue to
>                  expose the user to all the options from the very start,
>             and to
>                  help him/her make the right choice. Of course there
>             will be users
>                  that don't want or don't have the time to read/think
>             about all the
>                  options. Not a big deal: they'll just do
>             'overlapsAny(query, subject)',
>                  which is not a lot more typing than 'query %in%
>             subject', especially
>                  if they use tab completion.
>
>                  It's true that it's more common to ask questions about
>             overlap than
>                  about equality but there are some use cases for the
>             latter (as the
>                  original thread shows). Until now, when you had such a
>             use case, you
>                  could not use match() or %in%, which would have been
>             the natural things
>                  to use, because they got hijacked to do something else,
>             and you were
>                  left with nothing. Not a satisfying situation. So at a
>             minimum, we
>                  needed to restore the true/real/original semantic of
>             match() to do
>                  "equality" instead of "overlap". But it's hard to do
>             this for match()
>                  and not do it for %in% too. For more than 99% of R
>             users, %in% is
>                  just a simple wrapper for 'match(x, table, nomatch = 0)
>              > 0' (this
>                  is how it has been documented and implemented in base R
>             for many
>                  years). Not maintaining this relationship between %in%
>             and match()
>                  would only cause grief and frustration to newcomers to
>             Bioconductor.
>
>                  H.
>
>
>
>                  On 01/04/2013 03:32 PM, Cook, Malcolm wrote:
>
>                      Hiya again,
>
>                      I am definitely a late comer to BioC, so I
>             definitely easily
>                      defer to
>                      the tide of history.
>
>                      But I do think you miss my point Michael about the
>             proposed change
>                      making the relationship between %in% and match for
>             {G,I}Ranges{List}
>                      mimic that between other vectors, and I do think
>             that changing
>                      the API
>                      would make other late-comers take to BioC
>             easier/faster.
>
>                      That said, I NEVER use %in% so I really have no
>             stake in the
>                      matter, and
>                      I DEFINITELY appreciate the argument to not
>             changing the API
>                      just for
>                      sematic sweetness.
>
>                      That that said, Herve is _/so good/_ about
>             deprecations and warnings
>
>                      that make such changes fairly easily digestible.
>
>                      That that that.... enough.... I bow out of this
>             one....!!!!
>
>                      Always learning and Happy New Year to all lurkers,
>
>                      ~Malcolm
>
>                      *From:*Michael Lawrence
>             [mailto:lawrence.michael at gene.
>             <mailto:lawrence.michael at gene.>____com
>
>                      <mailto:lawrence.michael at gene.__com
>             <mailto:lawrence.michael at gene.com>>]
>                      *Sent:* Friday, January 04, 2013 5:11 PM
>                      *To:* Cook, Malcolm
>                      *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès
>                      (hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>); Tim
>
>
>                      Triche, Jr.; Vedran Franke;
>             bioconductor at r-project.org <mailto:bioconductor at r-project.org>
>                      <mailto:bioconductor at r-__project.org
>             <mailto:bioconductor at r-project.org>>
>                      *Subject:* Re: [BioC] countMatches() (was: table
>             for GenomicRanges)
>
>
>                      On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm
>             <MEC at stowers.org <mailto:MEC at stowers.org>
>                      <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>
>                      <mailto:MEC at stowers.org <mailto:MEC at stowers.org>
>             <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>>> wrote:
>
>                      Hiya,
>
>                      For what it is worth...
>
>                      I think the change to %in% is warranted.
>
>                      If I understand correctly, this change restores the
>             relationship
>                      between
>                      the semantics of `%in` and the semantics of `match`.
>
>                        From the docs:
>
>                          '"%in%" <- function(x, table) match(x, table,
>             nomatch = 0) > 0'
>
>                      Herve's change restores this relationship.
>
>
>                      match and %in% were initially consistent (both
>             considering any
>                      overlap);
>                      Herve has changed both of them together. The whole
>             idea behind
>                      IRanges
>                      is that ranges are special data types with special
>             semantics. We
>                      have
>                      reimplemented much of the existing R vector API
>             using those
>                      semantics;
>                      this extends beyond match/%in%. I am hesitant about
>             making such
>                      sweeping
>                      changes to the API so late in the life-cycle of the
>             package.
>                      There was a
>                      feature request for a way to count identical ranges
>             in a set of
>                      ranges.
>                      Let's please not get carried away and start
>             redesigning the API
>                      for this
>                      one, albeit useful, request. There are all sorts of
>                      inconsistencies in
>                      the API, and many of them were conscious decisions
>             that considered
>                      practical use cases.
>
>                      Michael
>
>
>                           Herve, I suspect you were you as a result able to
>                      completely drop
>                           all the `%in%,BiocClass1,BiocClass2`
>             definitions and depend
>                      upon
>                           base::%in%
>
>                           Am I right?
>
>                           If so, may I suggest that Herve stay the
>             course, with the
>                      addition of
>                              '"%ol%" <- function(a, b) findOverlaps(a,
>             b, maxgap=0L,
>                           minoverlap=1L, type='any', select='all') > 0'
>
>                           This would provide a perspicacious idiom, thereby
>                      optimizing the API
>                           for Michaels observed common use case.
>
>                           Just sayin'
>
>                           ~Malcolm
>
>
>                             .-----Original Message-----
>                             .From:
>             bioconductor-bounces at r-____project.org
>             <mailto:bioconductor-bounces at r-__project.org>
>                      <mailto:bioconductor-bounces at __r-project.org
>             <mailto:bioconductor-bounces at r-project.org>>
>                           <mailto:bioconductor-bounces@
>             <mailto:bioconductor-bounces@>____r-project.org
>             <http://r-project.org>
>                      <mailto:bioconductor-bounces at __r-project.org
>             <mailto:bioconductor-bounces at r-project.org>>>
>                           [mailto:bioconductor-bounces@
>             <mailto:bioconductor-bounces@>____r-project.org
>             <http://r-project.org>
>                      <mailto:bioconductor-bounces at __r-project.org
>             <mailto:bioconductor-bounces at r-project.org>>
>
>                           <mailto:bioconductor-bounces@
>             <mailto:bioconductor-bounces@>____r-project.org
>             <http://r-project.org>
>
>                      <mailto:bioconductor-bounces at __r-project.org
>             <mailto:bioconductor-bounces at r-project.org>>>] On Behalf Of Sean
>                      Davis
>                             .Sent: Friday, January 04, 2013 3:37 PM
>                             .To: Michael Lawrence
>                             .Cc: Tim Triche, Jr.; Vedran Franke;
>             bioconductor at r-project.org
>             <mailto:bioconductor at r-project.org>
>             <mailto:bioconductor at r-__project.org
>             <mailto:bioconductor at r-project.org>>
>                           <mailto:bioconductor at r-____project.org
>             <mailto:bioconductor at r-__project.org>
>
>                      <mailto:bioconductor at r-__project.org
>             <mailto:bioconductor at r-project.org>>>
>
>                             .Subject: Re: [BioC] countMatches() (was:
>             table for
>                      GenomicRanges)
>                             .
>                             .On Fri, Jan 4, 2013 at 4:32 PM, Michael
>             Lawrence
>                             .<lawrence.michael at gene.com
>             <mailto:lawrence.michael at gene.com>
>                      <mailto:lawrence.michael at gene.__com
>             <mailto:lawrence.michael at gene.com>>
>                      <mailto:lawrence.michael at gene.
>             <mailto:lawrence.michael at gene.>____com
>
>                      <mailto:lawrence.michael at gene.__com
>             <mailto:lawrence.michael at gene.com>>>> wrote:
>                             .> The change to the behavior of %in% is a
>             pretty big
>                      one. Are you
>                           thinking
>                             .> that all set-based operations should
>             behave this way? For
>                           example, setdiff
>                             .> and intersect? I really liked the syntax
>             of "peaks
>                      %in% genes".
>                           In my
>                             .> experience, it's way more common to ask
>             questions
>                      about overlap
>                           than about
>                             .> equality, so I'd rather optimize the API
>             for that use
>                      case. But
>                           again,
>                             .> that's just my personal bias.
>                             .
>                             .For what it is worth, I share Michael's
>             personal bias here.
>                             .
>                             .Sean
>                             .
>                             .
>                             .> Michael
>                             .>
>                             .>
>                             .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès
>                      <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>                           <mailto:hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org>>>> wrote:
>                             .>
>                             .>> Hi,
>                             .>>
>                             .>> I added findMatches() and countMatches()
>             to the
>                      latest IRanges /
>                             .>> GenomicRanges packages (in BioC devel only).
>                             .>>
>                             .>>   findMatches(x, table): An enhanced
>             version of
>                      ‘match’ that
>                             .>>           returns all the matches in a
>             Hits object.
>                             .>>
>                             .>>   countMatches(x, table): Returns an
>             integer vector
>                      of the length
>                             .>>           of ‘x’, containing the number
>             of matches in
>                      ‘table’ for
>                             .>>           each element in ‘x’.
>                             .>>
>
>                             .>> countMatches() is what you can use to
>                      tally/count/tabulate
>                           (choose your
>
>                             .>> preferred term) the unique elements in a
>             GRanges object:
>                             .>>
>                             .>>   library(GenomicRanges)
>                             .>>   set.seed(33)
>                             .>>   gr <- GRanges("chr1",
>                      IRanges(sample(15,20,replace=*____*TRUE),
>
>                           width=5))
>                             .>>
>                             .>> Then:
>                             .>>
>                             .>>   > gr_levels <- sort(unique(gr))
>                             .>>   > countMatches(gr_levels, gr)
>                             .>>    [1] 1 1 1 2 4 2 2 1 2 2 2
>                             .>>
>                             .>> Note that findMatches() and
>             countMatches() also work on
>                           IRanges and
>                             .>> DNAStringSet objects, as well as on
>             ordinary atomic
>                      vectors:
>                             .>>
>                             .>>   library(hgu95av2probe)
>                             .>>   library(Biostrings)
>                             .>>   probes <- DNAStringSet(hgu95av2probe)
>                             .>>   unique_probes <- unique(probes)
>                             .>>   count <- countMatches(unique_probes,
>             probes)
>                             .>>   max(count)  # 7
>                             .>>
>                             .>> I made other changes in
>             IRanges/GenomicRanges so that
>                      the notion
>                             .>> of "match" between elements of a
>             vector-like object now
>                           consistently
>                             .>> means "equality" instead of "overlap",
>             even for
>                      range-based
>                           objects
>                             .>> like IRanges or GRanges objects. This
>             notion of
>                      "equality" is the
>                             .>> same that is used by ==. The most
>             visible consequence
>                      of those
>                             .>> changes is that using %in% between 2
>             IRanges or
>                      GRanges objects
>                             .>> 'query' and 'subject' in order to do
>             overlaps was
>                      replaced by
>                             .>> overlapsAny(query, subject).
>                             .>>
>                             .>>   overlapsAny(query, subject): Finds the
>             ranges in
>                      ‘query’ that
>                             .>>      overlap any of the ranges in ‘subject’.
>                             .>>
>
>                             .>> There are warnings and deprecation
>             messages in place
>                      to help
>                           smooth
>
>                             .>> the transition.
>                             .>>
>                             .>> Cheers,
>                             .>> H.
>                             .>>
>                             .>> --
>                             .>> Hervé Pagès
>                             .>>
>                             .>> Program in Computational Biology
>                             .>> Division of Public Health Sciences
>                             .>> Fred Hutchinson Cancer Research Center
>                             .>> 1100 Fairview Ave. N, M1-B514
>                             .>> P.O. Box 19024
>                             .>> Seattle, WA 98109-1024
>                             .>>
>                             .>> E-mail: hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
>             <mailto:hpages at fhcrc.org>>
>                      <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>>
>
>                             .>> Phone: (206) 667-5791
>             <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
>                      <tel:%28206%29%20667-5791>
>                             .>> Fax: (206) 667-1319
>             <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>                      <tel:%28206%29%20667-1319>
>
>                             .>>
>                             .>
>                             .>         [[alternative HTML version deleted]]
>                             .>
>                             .>
>                             .>
>             ___________________________________________________
>
>                             .> Bioconductor mailing list
>                             .> Bioconductor at r-project.org
>             <mailto:Bioconductor at r-project.org>
>                      <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>                      <mailto:Bioconductor at r-____project.org
>             <mailto:Bioconductor at r-__project.org>
>                      <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>>
>
>                             .>
>             https://stat.ethz.ch/mailman/____listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                             .> Search the archives:
>             http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>                             .
>
>               .___________________________________________________
>
>                             .Bioconductor mailing list
>                             .Bioconductor at r-project.org
>             <mailto:Bioconductor at r-project.org>
>                      <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>
>                      <mailto:Bioconductor at r-____project.org
>             <mailto:Bioconductor at r-__project.org>
>                      <mailto:Bioconductor at r-__project.org
>             <mailto:Bioconductor at r-project.org>>>
>
>
>               .https://stat.ethz.ch/mailman/____listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
>             <https://stat.ethz.ch/mailman/__listinfo/bioconductor
>             <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
>                             .Search the archives:
>             http://news.gmane.org/gmane.____science.biology.informatics.____conductor
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>
>             <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>             <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
>                  --
>                  Hervé Pagès
>
>                  Program in Computational Biology
>                  Division of Public Health Sciences
>                  Fred Hutchinson Cancer Research Center
>                  1100 Fairview Ave. N, M1-B514
>                  P.O. Box 19024
>                  Seattle, WA 98109-1024
>
>                  E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>                  Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>             <tel:%28206%29%20667-5791>
>                  Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>             <tel:%28206%29%20667-1319>
>
>
>
>         --
>         Hervé Pagès
>
>         Program in Computational Biology
>         Division of Public Health Sciences
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M1-B514
>         P.O. Box 19024
>         Seattle, WA 98109-1024
>
>         E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319