[BioC] table for GenomicRanges

Hervé Pagès hpages at fhcrc.org
Thu Dec 6 21:23:06 CET 2012



On 12/06/2012 12:20 PM, Hervé Pagès wrote:
> Thanks Tim for the feedback.
>
> I spent a little more time thinking about this and then realized that
> it was related to some other feature I've been thinking of a few months
> ago, then I forgot about. This other feature extends the capability
> of match() by finding *all* the matches (not just the first like
> match()). Surprisingly this is not in base R (but maybe it is, I'm
> just not aware of it). So I'd like to add the 2 following functions
> to IRanges:
>
>    findMatches(query, subject)
>    countMatches(query, subject)
>
> The parallel with findOverlaps/countOverlaps is obvious except that by
> "match" here we mean "exact match", so the concept is valid for any
> vector-like query and subject with elements that can be compared (i.e.
> query[i] == subject[j] is defined).
>
> Like findOverlaps(), findMatches() will return a Hits object.
> No 'select' arg for now but we could always add it later and
> findOverlaps(. , select="first") would be equivalent to match(.)
   ^^^^^^^^^^^
I meant findMatches here, sorry.

H.

>
> I already have initial implementations for those 2 functions (with
> room for some performance improvements).
>
> I mention this here because once we have countMatches(), implementing
> the count/tally feature becomes very simple. It's just:
>
>    x_levels <- sort(unique(x))
>    mcols(x_levels)$count <- countMatches(x_levels, x)
>
> Now it's so simple that I wonder if it's still worth providing a
> function for it. By using countMatches() the user can choose where
> to put the vector of counts, and how to name the metadata col in
> case s/he wants to stick it in 'x_levels'.
>
> I'll start by adding findMatches()/countMatches(). I think they are
> valuable per-se.
>
> Cheers,
> H.
>
>
> On 12/05/2012 07:36 PM, Tim Triche, Jr. wrote:
>> Now that I am rereading what I wrote, I am wondering whether a column
>> labeled $count for a given combination of (seqname, start, end, strand)
>> is more immediately obvious to a user, i.e. "there are mcols(GR)$count
>> ranges that match this exact description in this here GR".
>>
>> Perhaps count() really is the way to go, especially from the standpoint
>> of making it intuitive.  There are 74465 segments exactly matching the
>> state "WeakEnhancer" in the HMM that I summarized (don't bother looking
>> for it in ENCODE or Roadmap data, it is a bespoke model with additional
>> marks).  Similarly, a "table" of unique reads placements that share a
>> common range would summarize them by how many times we saw each.  And
>> then the user would see counts accordingly.
>>
>> ->  I respectfully withdraw my proposal regarding $tally and now feel
>> that $count is closer to what people would expect.  Does that make sense?
>>
>> Example:
>>
>> mimatGR <- subsetByOverlaps(readsGR, tx.mimat)
>> count(mimatGR)  # table of appearances of unique reads, i.e., a library
>> complexity surrogate
>>
>> or more interestingly, let's say there's an ncRNA and a coding RNA
>> emerging from a bidirectional promoter.  Then you could have
>>
>> maybeNcRNA <- reduce(c(tx.ncRNA, tx.mRNA)) # forward and reverse, maybe
>> lots of exons, perhaps even a little trans splicing
>> bidirectionalGR <- subsetByOverlaps(readsGR, maybeNcRNA) # what's really
>> going on here?  count() acts as a spot check
>> count(bidirectionalGR)
>>
>> the latter, especially with a strand-specific RNAseq protocol, could
>> tell you about some really interesting biology if conducted along a
>> "theme".
>>
>> IMHO of course, since I'm not the one implementing it, but more thinking
>> about the use cases I encounter.
>>
>> YMMV and I hope other people will point out the many ways in which I am
>> heavily constipated.  :-)
>>
>> Thanks Herve for your ninja programming skills, you code faster than I
>> can think (at least clearly).
>>
>> --t
>>
>>
>>
>> On Wed, Dec 5, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche at gmail.com
>> <mailto:tim.triche at gmail.com>> wrote:
>>
>>     maybe (.)$count for tallying overlaps, and (.$)tally for counting
>>     equalities, by default?
>>
>>     ha ha only serious.  if the default is to countOverlaps then $count
>>     is a handy shortcut.  if you want to tally up matching ranges $tally
>>     is more evocative
>>
>>
>>
>>     On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages at fhcrc.org
>>     <mailto:hpages at fhcrc.org>> wrote:
>>
>>         Hi Tim,
>>
>>
>>         On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote:
>>
>>             that's cool, then it's consistent with score() and
>>             friends... this
>>             sounds like a grand scheme
>>
>>
>>         It's not clear to me if you are voting for tally() +
>> mcols(.)$tally
>>         or for count(.) + mcols()$count?
>>
>>         Note that this functionality doesn't have to be restricted to
>>         GenomicRanges and can be extended to any Vector object 'x' for
>>         which unique(), sort() and match() work and "do the right thing".
>>         Then the implementation is simply:
>>
>>
>>              ans <- sort(unique(x))
>>              names(ans) <- mcols(ans) <- NULL
>>              y <- match(x, ans)
>>              mcols(ans)$count <- tabulate(y, nbins=length(ans))
>>              ans
>>
>>         Unfortunately right now match() on GenomicRanges and Ranges
>> objects
>>         reports a match in case of *overlap* instead of *equality*
>> (this is
>>         why I needed to use IRanges:::matchIntegerQuads in the
>>         implementation
>>         of tableGenomicRanges2() I showed previously). Just a heads-up
>> that
>>         I'd like to change this but with a transition plan e.g. with an
>>         extra
>>         argument to the match() generic for letting the user control what
>>         "match" means (default would be "overlaps" for now but should
>>         probably
>>         be changed to "equality" in the future).
>>
>>         Cheers,
>>         H.
>>
>>
>>
>>
>>             On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès
>>             <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>>
>>                  So with tally, what would be the name of the metadata
>>             col? Would "tally"
>>                  be OK?
>>
>>                  It's kind of neat to use the same name for the metadata
>>             col as for the
>>                  function itself. That makes the code more readable:
>>
>>                     mcols(count(gr))$count
>>
>>                  Thanks for the feedback,
>>                  H.
>>
>>
>>
>>                  On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote:
>>
>>                      that goes along nicely with BamTally in gmapR,
>>             which is a damn
>>                      useful
>>                      function IMHO... actually I think I wrote a tally()
>>             function or
>>                      something like it for Rsubread, can't remember
>>             whether I sent it
>>                      in with
>>                      the patch though.  Anyways, a running tally is a
>>             good mental
>>                      image for
>>                      this functionality
>>
>>
>>
>>
>>                      On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou
>>                      <mailinglist.honeypot at gmail.____com
>>                      <mailto:mailinglist.honeypot at __gmail.com
>>             <mailto:mailinglist.honeypot at gmail.com>>
>>                      <mailto:mailinglist.honeypot@
>>             <mailto:mailinglist.honeypot@>____gmail.com
>> <http://gmail.com>
>>
>>                      <mailto:mailinglist.honeypot at __gmail.com
>>             <mailto:mailinglist.honeypot at gmail.com>>>>
>>
>>                      wrote:
>>
>>                           Hi,
>>
>>                           On Wed, Dec 5, 2012 at 5:50 PM, Michael
>> Lawrence
>>                           <lawrence.michael at gene.com
>>             <mailto:lawrence.michael at gene.com>
>>                      <mailto:lawrence.michael at gene.__com
>>             <mailto:lawrence.michael at gene.com>>
>>                      <mailto:lawrence.michael at gene.
>>             <mailto:lawrence.michael at gene.>____com
>>
>>                      <mailto:lawrence.michael at gene.__com
>>             <mailto:lawrence.michael at gene.com>>>> wrote:
>>                            > The question is whether there is ever a use
>>             case to have
>>                      a simple
>>                           table.
>>                            > This is analogous to base R's table and
>>             data.frame. For
>>                      example,
>>                           if you
>>                            > call xtabs(), you get a table, then you
>>             have to call
>>                           as.data.frame to get
>>                            > back into the data.frame context. This is
>>             sort of clean,
>>                      and we could
>>                            > create an extension of table that for
>>             efficiency stores the
>>                           associated
>>                            > GRanges along with the counts in the .Data
>>             slot. Then as(x,
>>                           "GRanges") on
>>                            > that would generate the GRanges with the
>>             count column.
>>                      That would be
>>                            > complicated though.
>>                            >
>>                            > Another issue is that table() cannot be
>>             used in the
>>                      general way,
>>                           due to
>>                            > restrictions on dispatch with "...".
>>                            >
>>                            > So I think I'm in favor of a new "count"
>>             generic. That
>>                      naming is
>>                           consistent
>>                            > with countOverlaps, countSubjects,
>>             countQueries, etc.
>>
>>                           Or maybe `tally`? Somehow I have a mental
>>             association w/
>>                      that being
>>                           closer to `tabulate`, but I guess it's really
>>             not and maybe
>>                      it's just
>>                           me that has a mind map that puts `tally`
>>             closer to `table` than
>>                           `count` is .
>>
>>                           --
>>                           Steve Lianoglou
>>                           Graduate Student: Computational Systems Biology
>>                             | Memorial Sloan-Kettering Cancer Center
>>                             | Weill Medical College of Cornell University
>>                           Contact Info:
>>             http://cbio.mskcc.org/~lianos/____contact
>>             <http://cbio.mskcc.org/~lianos/__contact>
>>
>>                      <http://cbio.mskcc.org/~__lianos/contact
>>             <http://cbio.mskcc.org/~lianos/contact>>
>>
>>
>>
>>
>>                      --
>>                      /A model is a lie that helps you see the truth./
>>                      /
>>                      /
>>                      Howard Skipper
>>
>>
>> <http://cancerres.__aacrjourna__ls.org/content/31/9/__1173.__full.pdf
>>             <http://aacrjournals.org/content/31/9/__1173.full.pdf>
>>
>> <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
>>
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>>
>>
>>
>>
>>                  --
>>                  Hervé Pagès
>>
>>                  Program in Computational Biology
>>                  Division of Public Health Sciences
>>                  Fred Hutchinson Cancer Research Center
>>                  1100 Fairview Ave. N, M1-B514
>>                  P.O. Box 19024
>>                  Seattle, WA 98109-1024
>>
>>                  E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>             <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>>                  Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>             <tel:%28206%29%20667-5791>
>>                  Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>             <tel:%28206%29%20667-1319>
>>
>>
>>
>>
>>
>>             --
>>             /A model is a lie that helps you see the truth./
>>             /
>>             /
>>             Howard Skipper
>>
>> <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
>>
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>
>>
>>
>>         --
>>         Hervé Pagès
>>
>>         Program in Computational Biology
>>         Division of Public Health Sciences
>>         Fred Hutchinson Cancer Research Center
>>         1100 Fairview Ave. N, M1-B514
>>         P.O. Box 19024
>>         Seattle, WA 98109-1024
>>
>>         E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>>
>>
>>     --
>>     /A model is a lie that helps you see the truth./
>>     /
>>     /
>>     Howard Skipper
>>     <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>>
>>
>>
>> --
>> /A model is a lie that helps you see the truth./
>> /
>> /
>> Howard Skipper
>> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list