[BioC] countMatches() (was: table for GenomicRanges)
Hervé Pagès
hpages at fhcrc.org
Tue Jan 8 18:20:12 CET 2013
Hi Tim,
I could add the %ov% operator as a replacement for the old %in%. So you
would write 'peaks %ov% genes' instead of 'peaks %in% genes'. Would just
be a convenience wrapper for 'overlapsAny(peaks, genes)'.
Cheers,
H.
On 01/07/2013 11:45 AM, Tim Triche, Jr. wrote:
> So why not leave %in% as it was and transition everything forward to
> explicitly using { `%within%`, `%overlaps%`|`%overlapping%`, `%equals%`
> } such that
>
> identical( x %within% table, countOverlaps(x, table, type='within') >
> 0 ) == TRUE
> identical( x %overlaps% table, countOverlaps(x, table, type='any') >
> 0 ) == TRUE
> identical( x %equals% table, countOverlaps(x, table, type='equal') >
> 0 ) == TRUE
>
> and for the time being,
>
> identical( x %overlaps% table, countOverlaps(x, table, type='any') >
> 0 ) == TRUE ## but with a noisy nastygram that will halt if
> options("warn"=2)
> No breakage for %in% methods until such time as a full deprecation cycle
> has passed, and if the maintainers can't be arsed to do anything at all
> about the warnings by the second full release, then perhaps they don't
> really care that much after all. Just a thought?
>
> From someone (me) who has their own issues with keeping everything up
> to date and should know better. If you want to use %in% for
>
> peaks %in% genes (why on earth would you do this rather than peaks
> %in% promoters(genes), anyways?)
>
> then a nastygram could be emitted "WARNING: YOUR SHORTHAND NOTATION IS
> DOOMED AFTER BIOC 2.13, YOU WILL BE ASSIMILATED" and everyone is (more
> or less) happy.
>
>
>
> On Mon, Jan 7, 2013 at 11:33 AM, Michael Lawrence
> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>
>
>
>
> On Mon, Jan 7, 2013 at 11:00 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Michael,
>
> I don't think "match" (the word) always has to mean "equality"
> either.
> However having match() (the function) do "whole exact matching" (aka
> "equality") for any kind of vector-like object has the advantage of:
>
> (a) making it consistent with base::match() (?base::match is
> pretty
> explicit about what the contract of match() is)
>
>
> (a) alone is obviously not enough. We have many methods, like the
> set operations, that treat ranges specially. Are we going to start
> moving everything toward the base behavior? And have rangeIntersect,
> rangeSetdiff, etc?
>
> (b) preserving its relationship with ==, duplicated(), unique(),
> etc...
>
>
> So it becomes consistent with duplicated/unique, but we lose
> consistency with the set operations.
>
> (c) not frustrating the user who needs something to do exact
> matching on ranges (as I mentioned previously, if you take
> match() away from him/her, s/he'll be left with nothing).
>
>
> No one has ever asked for match() to behave this way. There was a
> request for a way to tabulate identical ranges. It was a nice idea
> to extract the general "outer equal" findMatches function. But the
> changes seem to be snow-balling. These types of changes mean a lot
> of maintenance work for the users. A deprecation cycle does not
> circumvent that.
>
>
> IMO those advantages counterbalance *by far* the very little
> convenience you get from having 'match(query, subject)' do
> 'findOverlaps(query, subject, select="first")' on
> IRanges/GRanges objects. If you need to do that, just use the
> latter, or, if you think that's still too much typing, define
> a wrapper e.g. 'ovmatch(query, subject)'.
>
> There are plenty of specialized tools around for doing
> inexact/fuzzy/partial/overlap matching for many particular types
> of vector-like objects: grep() and family, pmatch(), charmatch(),
> agrep(), grepRaw(), matchPattern() and family, findOverlaps() and
> family, findIntervals(), etc... For the reasons I mentioned
> above, none of them should hijack match() to make it do some
> particular type of inexact matching on some particular type of
> objects. Even if, for that particular type of objects, doing that
> particular type of inexact matching is more common than doing
> exact matching.
>
> H.
>
>
>
> On 01/06/2013 05:39 PM, Michael Lawrence wrote:
>
> I think having overlapsAny is a nice addition and helps make
> the API
> more complete and explicit. Are you sure we need to change
> the behavior
> of the match method for this relatively uncommon use case?
>
>
> Yes because otherwise users with a use case of doing match()
>
> even if it's uncommon,
>
>
> I don't think
> "match" always has to mean "equality". It is a more general
> concept in
> my mind. The most common use case for matching ranges is
> overlap.
>
>
> Of course "match" doesn't always have to mean equality. But of base
>
>
> Michael
>
>
> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
> Yes 'peaks %in% genes' is cute and was probably doing
> the right thing
> for most users (although not all). But 'exons %in%
> genes' is cute too
> and was probably doing the wrong thing for all users.
> Advanced users
> like you guys would have no problem switching to
>
> !is.na <http://is.na>
> <http://is.na>(findOverlaps(__peaks, genes, type="within",
> select="any"))
>
> or
>
> !is.na <http://is.na>
> <http://is.na>(findOverlaps(__peaks, genes, type="equal",
>
> select="any"))
>
> in case 'peaks %in% genes' was not doing exactly what
> you wanted,
> but most users would not find this particularly
> friendly. Even
> worse, some users probably didn't realize that 'peaks
> %in% genes'
> was not doing exactly what they thought it did because
> "peaks in
> genes" in English suggests that the peaks are within
> the genes,
> but it's not what 'peaks %in% genes' does.
>
> Having overlapsAny(), with exactly the same extra
> arguments as
> countOverlaps() and subsetByOverlaps() (i.e. 'maxgap',
> 'minoverlap',
> 'type', 'ignore.strand'), all of them documented (and
> with most
> users more or less familiar with them already) has the
> virtue to
> expose the user to all the options from the very start,
> and to
> help him/her make the right choice. Of course there
> will be users
> that don't want or don't have the time to read/think
> about all the
> options. Not a big deal: they'll just do
> 'overlapsAny(query, subject)',
> which is not a lot more typing than 'query %in%
> subject', especially
> if they use tab completion.
>
> It's true that it's more common to ask questions about
> overlap than
> about equality but there are some use cases for the
> latter (as the
> original thread shows). Until now, when you had such a
> use case, you
> could not use match() or %in%, which would have been
> the natural things
> to use, because they got hijacked to do something else,
> and you were
> left with nothing. Not a satisfying situation. So at a
> minimum, we
> needed to restore the true/real/original semantic of
> match() to do
> "equality" instead of "overlap". But it's hard to do
> this for match()
> and not do it for %in% too. For more than 99% of R
> users, %in% is
> just a simple wrapper for 'match(x, table, nomatch = 0)
> > 0' (this
> is how it has been documented and implemented in base R
> for many
> years). Not maintaining this relationship between %in%
> and match()
> would only cause grief and frustration to newcomers to
> Bioconductor.
>
> H.
>
>
>
> On 01/04/2013 03:32 PM, Cook, Malcolm wrote:
>
> Hiya again,
>
> I am definitely a late comer to BioC, so I
> definitely easily
> defer to
> the tide of history.
>
> But I do think you miss my point Michael about the
> proposed change
> making the relationship between %in% and match for
> {G,I}Ranges{List}
> mimic that between other vectors, and I do think
> that changing
> the API
> would make other late-comers take to BioC
> easier/faster.
>
> That said, I NEVER use %in% so I really have no
> stake in the
> matter, and
> I DEFINITELY appreciate the argument to not
> changing the API
> just for
> sematic sweetness.
>
> That that said, Herve is _/so good/_ about
> deprecations and warnings
>
> that make such changes fairly easily digestible.
>
> That that that.... enough.... I bow out of this
> one....!!!!
>
> Always learning and Happy New Year to all lurkers,
>
> ~Malcolm
>
> *From:*Michael Lawrence
> [mailto:lawrence.michael at gene.
> <mailto:lawrence.michael at gene.>____com
>
> <mailto:lawrence.michael at gene.__com
> <mailto:lawrence.michael at gene.com>>]
> *Sent:* Friday, January 04, 2013 5:11 PM
> *To:* Cook, Malcolm
> *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès
> (hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>); Tim
>
>
> Triche, Jr.; Vedran Franke;
> bioconductor at r-project.org <mailto:bioconductor at r-project.org>
> <mailto:bioconductor at r-__project.org
> <mailto:bioconductor at r-project.org>>
> *Subject:* Re: [BioC] countMatches() (was: table
> for GenomicRanges)
>
>
> On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm
> <MEC at stowers.org <mailto:MEC at stowers.org>
> <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>
> <mailto:MEC at stowers.org <mailto:MEC at stowers.org>
> <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>>> wrote:
>
> Hiya,
>
> For what it is worth...
>
> I think the change to %in% is warranted.
>
> If I understand correctly, this change restores the
> relationship
> between
> the semantics of `%in` and the semantics of `match`.
>
> From the docs:
>
> '"%in%" <- function(x, table) match(x, table,
> nomatch = 0) > 0'
>
> Herve's change restores this relationship.
>
>
> match and %in% were initially consistent (both
> considering any
> overlap);
> Herve has changed both of them together. The whole
> idea behind
> IRanges
> is that ranges are special data types with special
> semantics. We
> have
> reimplemented much of the existing R vector API
> using those
> semantics;
> this extends beyond match/%in%. I am hesitant about
> making such
> sweeping
> changes to the API so late in the life-cycle of the
> package.
> There was a
> feature request for a way to count identical ranges
> in a set of
> ranges.
> Let's please not get carried away and start
> redesigning the API
> for this
> one, albeit useful, request. There are all sorts of
> inconsistencies in
> the API, and many of them were conscious decisions
> that considered
> practical use cases.
>
> Michael
>
>
> Herve, I suspect you were you as a result able to
> completely drop
> all the `%in%,BiocClass1,BiocClass2`
> definitions and depend
> upon
> base::%in%
>
> Am I right?
>
> If so, may I suggest that Herve stay the
> course, with the
> addition of
> '"%ol%" <- function(a, b) findOverlaps(a,
> b, maxgap=0L,
> minoverlap=1L, type='any', select='all') > 0'
>
> This would provide a perspicacious idiom, thereby
> optimizing the API
> for Michaels observed common use case.
>
> Just sayin'
>
> ~Malcolm
>
>
> .-----Original Message-----
> .From:
> bioconductor-bounces at r-____project.org
> <mailto:bioconductor-bounces at r-__project.org>
> <mailto:bioconductor-bounces at __r-project.org
> <mailto:bioconductor-bounces at r-project.org>>
> <mailto:bioconductor-bounces@
> <mailto:bioconductor-bounces@>____r-project.org
> <http://r-project.org>
> <mailto:bioconductor-bounces at __r-project.org
> <mailto:bioconductor-bounces at r-project.org>>>
> [mailto:bioconductor-bounces@
> <mailto:bioconductor-bounces@>____r-project.org
> <http://r-project.org>
> <mailto:bioconductor-bounces at __r-project.org
> <mailto:bioconductor-bounces at r-project.org>>
>
> <mailto:bioconductor-bounces@
> <mailto:bioconductor-bounces@>____r-project.org
> <http://r-project.org>
>
> <mailto:bioconductor-bounces at __r-project.org
> <mailto:bioconductor-bounces at r-project.org>>>] On Behalf Of Sean
> Davis
> .Sent: Friday, January 04, 2013 3:37 PM
> .To: Michael Lawrence
> .Cc: Tim Triche, Jr.; Vedran Franke;
> bioconductor at r-project.org
> <mailto:bioconductor at r-project.org>
> <mailto:bioconductor at r-__project.org
> <mailto:bioconductor at r-project.org>>
> <mailto:bioconductor at r-____project.org
> <mailto:bioconductor at r-__project.org>
>
> <mailto:bioconductor at r-__project.org
> <mailto:bioconductor at r-project.org>>>
>
> .Subject: Re: [BioC] countMatches() (was:
> table for
> GenomicRanges)
> .
> .On Fri, Jan 4, 2013 at 4:32 PM, Michael
> Lawrence
> .<lawrence.michael at gene.com
> <mailto:lawrence.michael at gene.com>
> <mailto:lawrence.michael at gene.__com
> <mailto:lawrence.michael at gene.com>>
> <mailto:lawrence.michael at gene.
> <mailto:lawrence.michael at gene.>____com
>
> <mailto:lawrence.michael at gene.__com
> <mailto:lawrence.michael at gene.com>>>> wrote:
> .> The change to the behavior of %in% is a
> pretty big
> one. Are you
> thinking
> .> that all set-based operations should
> behave this way? For
> example, setdiff
> .> and intersect? I really liked the syntax
> of "peaks
> %in% genes".
> In my
> .> experience, it's way more common to ask
> questions
> about overlap
> than about
> .> equality, so I'd rather optimize the API
> for that use
> case. But
> again,
> .> that's just my personal bias.
> .
> .For what it is worth, I share Michael's
> personal bias here.
> .
> .Sean
> .
> .
> .> Michael
> .>
> .>
> .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>>> wrote:
> .>
> .>> Hi,
> .>>
> .>> I added findMatches() and countMatches()
> to the
> latest IRanges /
> .>> GenomicRanges packages (in BioC devel only).
> .>>
> .>> findMatches(x, table): An enhanced
> version of
> ‘match’ that
> .>> returns all the matches in a
> Hits object.
> .>>
> .>> countMatches(x, table): Returns an
> integer vector
> of the length
> .>> of ‘x’, containing the number
> of matches in
> ‘table’ for
> .>> each element in ‘x’.
> .>>
>
> .>> countMatches() is what you can use to
> tally/count/tabulate
> (choose your
>
> .>> preferred term) the unique elements in a
> GRanges object:
> .>>
> .>> library(GenomicRanges)
> .>> set.seed(33)
> .>> gr <- GRanges("chr1",
> IRanges(sample(15,20,replace=*____*TRUE),
>
> width=5))
> .>>
> .>> Then:
> .>>
> .>> > gr_levels <- sort(unique(gr))
> .>> > countMatches(gr_levels, gr)
> .>> [1] 1 1 1 2 4 2 2 1 2 2 2
> .>>
> .>> Note that findMatches() and
> countMatches() also work on
> IRanges and
> .>> DNAStringSet objects, as well as on
> ordinary atomic
> vectors:
> .>>
> .>> library(hgu95av2probe)
> .>> library(Biostrings)
> .>> probes <- DNAStringSet(hgu95av2probe)
> .>> unique_probes <- unique(probes)
> .>> count <- countMatches(unique_probes,
> probes)
> .>> max(count) # 7
> .>>
> .>> I made other changes in
> IRanges/GenomicRanges so that
> the notion
> .>> of "match" between elements of a
> vector-like object now
> consistently
> .>> means "equality" instead of "overlap",
> even for
> range-based
> objects
> .>> like IRanges or GRanges objects. This
> notion of
> "equality" is the
> .>> same that is used by ==. The most
> visible consequence
> of those
> .>> changes is that using %in% between 2
> IRanges or
> GRanges objects
> .>> 'query' and 'subject' in order to do
> overlaps was
> replaced by
> .>> overlapsAny(query, subject).
> .>>
> .>> overlapsAny(query, subject): Finds the
> ranges in
> ‘query’ that
> .>> overlap any of the ranges in ‘subject’.
> .>>
>
> .>> There are warnings and deprecation
> messages in place
> to help
> smooth
>
> .>> the transition.
> .>>
> .>> Cheers,
> .>> H.
> .>>
> .>> --
> .>> Hervé Pagès
> .>>
> .>> Program in Computational Biology
> .>> Division of Public Health Sciences
> .>> Fred Hutchinson Cancer Research Center
> .>> 1100 Fairview Ave. N, M1-B514
> .>> P.O. Box 19024
> .>> Seattle, WA 98109-1024
> .>>
> .>> E-mail: hpages at fhcrc.org
> <mailto:hpages at fhcrc.org> <mailto:hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>>
>
> .>> Phone: (206) 667-5791
> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> .>> Fax: (206) 667-1319
> <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
> .>>
> .>
> .> [[alternative HTML version deleted]]
> .>
> .>
> .>
> ___________________________________________________
>
> .> Bioconductor mailing list
> .> Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> <mailto:Bioconductor at r-____project.org
> <mailto:Bioconductor at r-__project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>>
>
> .>
> https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> .> Search the archives:
> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
> .
>
> .___________________________________________________
>
> .Bioconductor mailing list
> .Bioconductor at r-project.org
> <mailto:Bioconductor at r-project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>
> <mailto:Bioconductor at r-____project.org
> <mailto:Bioconductor at r-__project.org>
> <mailto:Bioconductor at r-__project.org
> <mailto:Bioconductor at r-project.org>>>
>
>
> .https://stat.ethz.ch/mailman/____listinfo/bioconductor
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor>
>
>
> <https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>>
> .Search the archives:
> http://news.gmane.org/gmane.____science.biology.informatics.____conductor
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor>
>
>
> <http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list