[BioC] table for GenomicRanges
Hervé Pagès
hpages at fhcrc.org
Thu Dec 6 21:23:06 CET 2012
On 12/06/2012 12:20 PM, Hervé Pagès wrote:
> Thanks Tim for the feedback.
>
> I spent a little more time thinking about this and then realized that
> it was related to some other feature I've been thinking of a few months
> ago, then I forgot about. This other feature extends the capability
> of match() by finding *all* the matches (not just the first like
> match()). Surprisingly this is not in base R (but maybe it is, I'm
> just not aware of it). So I'd like to add the 2 following functions
> to IRanges:
>
> findMatches(query, subject)
> countMatches(query, subject)
>
> The parallel with findOverlaps/countOverlaps is obvious except that by
> "match" here we mean "exact match", so the concept is valid for any
> vector-like query and subject with elements that can be compared (i.e.
> query[i] == subject[j] is defined).
>
> Like findOverlaps(), findMatches() will return a Hits object.
> No 'select' arg for now but we could always add it later and
> findOverlaps(. , select="first") would be equivalent to match(.)
^^^^^^^^^^^
I meant findMatches here, sorry.
H.
>
> I already have initial implementations for those 2 functions (with
> room for some performance improvements).
>
> I mention this here because once we have countMatches(), implementing
> the count/tally feature becomes very simple. It's just:
>
> x_levels <- sort(unique(x))
> mcols(x_levels)$count <- countMatches(x_levels, x)
>
> Now it's so simple that I wonder if it's still worth providing a
> function for it. By using countMatches() the user can choose where
> to put the vector of counts, and how to name the metadata col in
> case s/he wants to stick it in 'x_levels'.
>
> I'll start by adding findMatches()/countMatches(). I think they are
> valuable per-se.
>
> Cheers,
> H.
>
>
> On 12/05/2012 07:36 PM, Tim Triche, Jr. wrote:
>> Now that I am rereading what I wrote, I am wondering whether a column
>> labeled $count for a given combination of (seqname, start, end, strand)
>> is more immediately obvious to a user, i.e. "there are mcols(GR)$count
>> ranges that match this exact description in this here GR".
>>
>> Perhaps count() really is the way to go, especially from the standpoint
>> of making it intuitive. There are 74465 segments exactly matching the
>> state "WeakEnhancer" in the HMM that I summarized (don't bother looking
>> for it in ENCODE or Roadmap data, it is a bespoke model with additional
>> marks). Similarly, a "table" of unique reads placements that share a
>> common range would summarize them by how many times we saw each. And
>> then the user would see counts accordingly.
>>
>> -> I respectfully withdraw my proposal regarding $tally and now feel
>> that $count is closer to what people would expect. Does that make sense?
>>
>> Example:
>>
>> mimatGR <- subsetByOverlaps(readsGR, tx.mimat)
>> count(mimatGR) # table of appearances of unique reads, i.e., a library
>> complexity surrogate
>>
>> or more interestingly, let's say there's an ncRNA and a coding RNA
>> emerging from a bidirectional promoter. Then you could have
>>
>> maybeNcRNA <- reduce(c(tx.ncRNA, tx.mRNA)) # forward and reverse, maybe
>> lots of exons, perhaps even a little trans splicing
>> bidirectionalGR <- subsetByOverlaps(readsGR, maybeNcRNA) # what's really
>> going on here? count() acts as a spot check
>> count(bidirectionalGR)
>>
>> the latter, especially with a strand-specific RNAseq protocol, could
>> tell you about some really interesting biology if conducted along a
>> "theme".
>>
>> IMHO of course, since I'm not the one implementing it, but more thinking
>> about the use cases I encounter.
>>
>> YMMV and I hope other people will point out the many ways in which I am
>> heavily constipated. :-)
>>
>> Thanks Herve for your ninja programming skills, you code faster than I
>> can think (at least clearly).
>>
>> --t
>>
>>
>>
>> On Wed, Dec 5, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche at gmail.com
>> <mailto:tim.triche at gmail.com>> wrote:
>>
>> maybe (.)$count for tallying overlaps, and (.$)tally for counting
>> equalities, by default?
>>
>> ha ha only serious. if the default is to countOverlaps then $count
>> is a handy shortcut. if you want to tally up matching ranges $tally
>> is more evocative
>>
>>
>>
>> On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>> wrote:
>>
>> Hi Tim,
>>
>>
>> On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote:
>>
>> that's cool, then it's consistent with score() and
>> friends... this
>> sounds like a grand scheme
>>
>>
>> It's not clear to me if you are voting for tally() +
>> mcols(.)$tally
>> or for count(.) + mcols()$count?
>>
>> Note that this functionality doesn't have to be restricted to
>> GenomicRanges and can be extended to any Vector object 'x' for
>> which unique(), sort() and match() work and "do the right thing".
>> Then the implementation is simply:
>>
>>
>> ans <- sort(unique(x))
>> names(ans) <- mcols(ans) <- NULL
>> y <- match(x, ans)
>> mcols(ans)$count <- tabulate(y, nbins=length(ans))
>> ans
>>
>> Unfortunately right now match() on GenomicRanges and Ranges
>> objects
>> reports a match in case of *overlap* instead of *equality*
>> (this is
>> why I needed to use IRanges:::matchIntegerQuads in the
>> implementation
>> of tableGenomicRanges2() I showed previously). Just a heads-up
>> that
>> I'd like to change this but with a transition plan e.g. with an
>> extra
>> argument to the match() generic for letting the user control what
>> "match" means (default would be "overlaps" for now but should
>> probably
>> be changed to "equality" in the future).
>>
>> Cheers,
>> H.
>>
>>
>>
>>
>> On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès
>> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>>
>> So with tally, what would be the name of the metadata
>> col? Would "tally"
>> be OK?
>>
>> It's kind of neat to use the same name for the metadata
>> col as for the
>> function itself. That makes the code more readable:
>>
>> mcols(count(gr))$count
>>
>> Thanks for the feedback,
>> H.
>>
>>
>>
>> On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote:
>>
>> that goes along nicely with BamTally in gmapR,
>> which is a damn
>> useful
>> function IMHO... actually I think I wrote a tally()
>> function or
>> something like it for Rsubread, can't remember
>> whether I sent it
>> in with
>> the patch though. Anyways, a running tally is a
>> good mental
>> image for
>> this functionality
>>
>>
>>
>>
>> On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.____com
>> <mailto:mailinglist.honeypot at __gmail.com
>> <mailto:mailinglist.honeypot at gmail.com>>
>> <mailto:mailinglist.honeypot@
>> <mailto:mailinglist.honeypot@>____gmail.com
>> <http://gmail.com>
>>
>> <mailto:mailinglist.honeypot at __gmail.com
>> <mailto:mailinglist.honeypot at gmail.com>>>>
>>
>> wrote:
>>
>> Hi,
>>
>> On Wed, Dec 5, 2012 at 5:50 PM, Michael
>> Lawrence
>> <lawrence.michael at gene.com
>> <mailto:lawrence.michael at gene.com>
>> <mailto:lawrence.michael at gene.__com
>> <mailto:lawrence.michael at gene.com>>
>> <mailto:lawrence.michael at gene.
>> <mailto:lawrence.michael at gene.>____com
>>
>> <mailto:lawrence.michael at gene.__com
>> <mailto:lawrence.michael at gene.com>>>> wrote:
>> > The question is whether there is ever a use
>> case to have
>> a simple
>> table.
>> > This is analogous to base R's table and
>> data.frame. For
>> example,
>> if you
>> > call xtabs(), you get a table, then you
>> have to call
>> as.data.frame to get
>> > back into the data.frame context. This is
>> sort of clean,
>> and we could
>> > create an extension of table that for
>> efficiency stores the
>> associated
>> > GRanges along with the counts in the .Data
>> slot. Then as(x,
>> "GRanges") on
>> > that would generate the GRanges with the
>> count column.
>> That would be
>> > complicated though.
>> >
>> > Another issue is that table() cannot be
>> used in the
>> general way,
>> due to
>> > restrictions on dispatch with "...".
>> >
>> > So I think I'm in favor of a new "count"
>> generic. That
>> naming is
>> consistent
>> > with countOverlaps, countSubjects,
>> countQueries, etc.
>>
>> Or maybe `tally`? Somehow I have a mental
>> association w/
>> that being
>> closer to `tabulate`, but I guess it's really
>> not and maybe
>> it's just
>> me that has a mind map that puts `tally`
>> closer to `table` than
>> `count` is .
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info:
>> http://cbio.mskcc.org/~lianos/____contact
>> <http://cbio.mskcc.org/~lianos/__contact>
>>
>> <http://cbio.mskcc.org/~__lianos/contact
>> <http://cbio.mskcc.org/~lianos/contact>>
>>
>>
>>
>>
>> --
>> /A model is a lie that helps you see the truth./
>> /
>> /
>> Howard Skipper
>>
>>
>> <http://cancerres.__aacrjourna__ls.org/content/31/9/__1173.__full.pdf
>> <http://aacrjournals.org/content/31/9/__1173.full.pdf>
>>
>> <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
>>
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>>
>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>> <tel:%28206%29%20667-1319>
>>
>>
>>
>>
>>
>> --
>> /A model is a lie that helps you see the truth./
>> /
>> /
>> Howard Skipper
>>
>> <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
>>
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>>
>>
>> --
>> /A model is a lie that helps you see the truth./
>> /
>> /
>> Howard Skipper
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>>
>>
>>
>> --
>> /A model is a lie that helps you see the truth./
>> /
>> /
>> Howard Skipper
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list