[BioC] table for GenomicRanges
Hervé Pagès
hpages at fhcrc.org
Thu Dec 6 21:20:23 CET 2012
Thanks Tim for the feedback.
I spent a little more time thinking about this and then realized that
it was related to some other feature I've been thinking of a few months
ago, then I forgot about. This other feature extends the capability
of match() by finding *all* the matches (not just the first like
match()). Surprisingly this is not in base R (but maybe it is, I'm
just not aware of it). So I'd like to add the 2 following functions
to IRanges:
findMatches(query, subject)
countMatches(query, subject)
The parallel with findOverlaps/countOverlaps is obvious except that by
"match" here we mean "exact match", so the concept is valid for any
vector-like query and subject with elements that can be compared (i.e.
query[i] == subject[j] is defined).
Like findOverlaps(), findMatches() will return a Hits object.
No 'select' arg for now but we could always add it later and
findOverlaps(. , select="first") would be equivalent to match(.)
I already have initial implementations for those 2 functions (with
room for some performance improvements).
I mention this here because once we have countMatches(), implementing
the count/tally feature becomes very simple. It's just:
x_levels <- sort(unique(x))
mcols(x_levels)$count <- countMatches(x_levels, x)
Now it's so simple that I wonder if it's still worth providing a
function for it. By using countMatches() the user can choose where
to put the vector of counts, and how to name the metadata col in
case s/he wants to stick it in 'x_levels'.
I'll start by adding findMatches()/countMatches(). I think they are
valuable per-se.
Cheers,
H.
On 12/05/2012 07:36 PM, Tim Triche, Jr. wrote:
> Now that I am rereading what I wrote, I am wondering whether a column
> labeled $count for a given combination of (seqname, start, end, strand)
> is more immediately obvious to a user, i.e. "there are mcols(GR)$count
> ranges that match this exact description in this here GR".
>
> Perhaps count() really is the way to go, especially from the standpoint
> of making it intuitive. There are 74465 segments exactly matching the
> state "WeakEnhancer" in the HMM that I summarized (don't bother looking
> for it in ENCODE or Roadmap data, it is a bespoke model with additional
> marks). Similarly, a "table" of unique reads placements that share a
> common range would summarize them by how many times we saw each. And
> then the user would see counts accordingly.
>
> -> I respectfully withdraw my proposal regarding $tally and now feel
> that $count is closer to what people would expect. Does that make sense?
>
> Example:
>
> mimatGR <- subsetByOverlaps(readsGR, tx.mimat)
> count(mimatGR) # table of appearances of unique reads, i.e., a library
> complexity surrogate
>
> or more interestingly, let's say there's an ncRNA and a coding RNA
> emerging from a bidirectional promoter. Then you could have
>
> maybeNcRNA <- reduce(c(tx.ncRNA, tx.mRNA)) # forward and reverse, maybe
> lots of exons, perhaps even a little trans splicing
> bidirectionalGR <- subsetByOverlaps(readsGR, maybeNcRNA) # what's really
> going on here? count() acts as a spot check
> count(bidirectionalGR)
>
> the latter, especially with a strand-specific RNAseq protocol, could
> tell you about some really interesting biology if conducted along a "theme".
>
> IMHO of course, since I'm not the one implementing it, but more thinking
> about the use cases I encounter.
>
> YMMV and I hope other people will point out the many ways in which I am
> heavily constipated. :-)
>
> Thanks Herve for your ninja programming skills, you code faster than I
> can think (at least clearly).
>
> --t
>
>
>
> On Wed, Dec 5, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche at gmail.com
> <mailto:tim.triche at gmail.com>> wrote:
>
> maybe (.)$count for tallying overlaps, and (.$)tally for counting
> equalities, by default?
>
> ha ha only serious. if the default is to countOverlaps then $count
> is a handy shortcut. if you want to tally up matching ranges $tally
> is more evocative
>
>
>
> On Wed, Dec 5, 2012 at 7:06 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Tim,
>
>
> On 12/05/2012 04:55 PM, Tim Triche, Jr. wrote:
>
> that's cool, then it's consistent with score() and
> friends... this
> sounds like a grand scheme
>
>
> It's not clear to me if you are voting for tally() + mcols(.)$tally
> or for count(.) + mcols()$count?
>
> Note that this functionality doesn't have to be restricted to
> GenomicRanges and can be extended to any Vector object 'x' for
> which unique(), sort() and match() work and "do the right thing".
> Then the implementation is simply:
>
>
> ans <- sort(unique(x))
> names(ans) <- mcols(ans) <- NULL
> y <- match(x, ans)
> mcols(ans)$count <- tabulate(y, nbins=length(ans))
> ans
>
> Unfortunately right now match() on GenomicRanges and Ranges objects
> reports a match in case of *overlap* instead of *equality* (this is
> why I needed to use IRanges:::matchIntegerQuads in the
> implementation
> of tableGenomicRanges2() I showed previously). Just a heads-up that
> I'd like to change this but with a transition plan e.g. with an
> extra
> argument to the match() generic for letting the user control what
> "match" means (default would be "overlaps" for now but should
> probably
> be changed to "equality" in the future).
>
> Cheers,
> H.
>
>
>
>
> On Wed, Dec 5, 2012 at 4:31 PM, Hervé Pagès
> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>
> So with tally, what would be the name of the metadata
> col? Would "tally"
> be OK?
>
> It's kind of neat to use the same name for the metadata
> col as for the
> function itself. That makes the code more readable:
>
> mcols(count(gr))$count
>
> Thanks for the feedback,
> H.
>
>
>
> On 12/05/2012 03:05 PM, Tim Triche, Jr. wrote:
>
> that goes along nicely with BamTally in gmapR,
> which is a damn
> useful
> function IMHO... actually I think I wrote a tally()
> function or
> something like it for Rsubread, can't remember
> whether I sent it
> in with
> the patch though. Anyways, a running tally is a
> good mental
> image for
> this functionality
>
>
>
>
> On Wed, Dec 5, 2012 at 2:59 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.____com
> <mailto:mailinglist.honeypot at __gmail.com
> <mailto:mailinglist.honeypot at gmail.com>>
> <mailto:mailinglist.honeypot@
> <mailto:mailinglist.honeypot@>____gmail.com <http://gmail.com>
>
> <mailto:mailinglist.honeypot at __gmail.com
> <mailto:mailinglist.honeypot at gmail.com>>>>
>
> wrote:
>
> Hi,
>
> On Wed, Dec 5, 2012 at 5:50 PM, Michael Lawrence
> <lawrence.michael at gene.com
> <mailto:lawrence.michael at gene.com>
> <mailto:lawrence.michael at gene.__com
> <mailto:lawrence.michael at gene.com>>
> <mailto:lawrence.michael at gene.
> <mailto:lawrence.michael at gene.>____com
>
> <mailto:lawrence.michael at gene.__com
> <mailto:lawrence.michael at gene.com>>>> wrote:
> > The question is whether there is ever a use
> case to have
> a simple
> table.
> > This is analogous to base R's table and
> data.frame. For
> example,
> if you
> > call xtabs(), you get a table, then you
> have to call
> as.data.frame to get
> > back into the data.frame context. This is
> sort of clean,
> and we could
> > create an extension of table that for
> efficiency stores the
> associated
> > GRanges along with the counts in the .Data
> slot. Then as(x,
> "GRanges") on
> > that would generate the GRanges with the
> count column.
> That would be
> > complicated though.
> >
> > Another issue is that table() cannot be
> used in the
> general way,
> due to
> > restrictions on dispatch with "...".
> >
> > So I think I'm in favor of a new "count"
> generic. That
> naming is
> consistent
> > with countOverlaps, countSubjects,
> countQueries, etc.
>
> Or maybe `tally`? Somehow I have a mental
> association w/
> that being
> closer to `tabulate`, but I guess it's really
> not and maybe
> it's just
> me that has a mind map that puts `tally`
> closer to `table` than
> `count` is .
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info:
> http://cbio.mskcc.org/~lianos/____contact
> <http://cbio.mskcc.org/~lianos/__contact>
>
> <http://cbio.mskcc.org/~__lianos/contact
> <http://cbio.mskcc.org/~lianos/contact>>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
>
> <http://cancerres.__aacrjourna__ls.org/content/31/9/__1173.__full.pdf
> <http://aacrjournals.org/content/31/9/__1173.full.pdf>
> <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> <tel:%28206%29%20667-1319>
>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.__aacrjournals.org/content/31/9/__1173.full.pdf
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
>
>
>
> --
> /A model is a lie that helps you see the truth./
> /
> /
> Howard Skipper
> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list