[Bioc-devel] mapping between original and reduced ranges

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Thu Mar 15 13:44:37 CET 2012


So the key question is to what extent keeping track of where the
ranges comes from would slow down the reduce operation.  I am not
familiar enough with the algorithm to know this, but given how fast
IRanges is in general, I am not one for guessing on this.

I agree with Florian that this is a very typical use case.

Kasper

On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
<florian.hahne at novartis.com> wrote:
> Hi all,
> It is true that this is not terribly slow when you deal with fairly large
> range objects:
>
> foo <- GRanges(seqnames=sample(1:4, 1e6, TRUE),
> ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)), width=50))
> system.time(bar <- reduce(foo))
>   user  system elapsed
>  0.918   0.174   1.091
>
> system.time(foobar <- findOverlaps(foo, bar))
>   user  system elapsed
>  2.051   0.402   2.453
>
>
> However the whole process does take about 3x the time of just the reduce
> operation, and in my use case I want this to happen interactively, where
> waiting 3 seconds compared to 1 makes a huge difference...
>
> I wouldn't push this high up on the development agenda, but it seems to be
> something that is already 95% existing and could easily be added. But
> maybe I am wrong...
>
> Florian
>
>
>
>
> Florian Hahne
> Novartis Institute For Biomedical Research
> Translational Sciences / Preclinical Safety / PCS Informatics
> Expert Data Integration and Modeling Bioinformatics
> CHBS, WKL-135.2.26
> Novartis Institute For Biomedical Research, Werk Klybeck
> Klybeckstrasse 141
> CH-4057 Basel
> Switzerland
> Phone: +41 61 6967127
> Email : florian.hahne at novartis.com
>
>
>
>
>
>
>
> On 3/14/12 9:40 PM, "Kasper Daniel Hansen" <kasperdanielhansen at gmail.com>
> wrote:
>
>>We have discussed this a couple of times.  I routinely uses the reduce
>>followed by findOverlaps paradigm.  As Malcolm says it feels wrong,
>>but from a practical point of view it is pretty fast, so I stopped
>>worrying about it.  I only think there is a reason to do this, if it
>>is substantially faster.
>>
>>Kasper
>>
>>On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm <MEC at stowers.org> wrote:
>>> Chiming in....
>>>
>>> on a similar note....
>>>
>>> A version of `disjoin` which returns a Hits/RangesMapping additional to
>>>the GRanges result would be most useful  and probably not require much
>>>additional effort (assuming `disjoin` computes this internally)
>>>
>>> Of course, it is easy to live without since I can just perform the
>>>findOverlaps myself after the disjoin.... it just "feels wrong" (tm)
>>>
>>> Ahoy!
>>>
>>> ~Malcolm
>>>
>>>
>>>> -----Original Message-----
>>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
>>>> project.org] On Behalf Of Hahne, Florian
>>>> Sent: Wednesday, March 14, 2012 2:22 PM
>>>> To: bioc-devel at r-project.org
>>>> Subject: [Bioc-devel] mapping between original and reduced ranges
>>>>
>>>> This bounced before, guess the mailing list does not like HTML mails.
>>>>So
>>>> one more try:
>>>>
>>>> I had the following offline discussion with Michael about how one could
>>>> retain a mapping of the ranges in a GRanges object before and after
>>>> reduce. He suggested to take it to the list. Is that something that
>>>>could
>>>> be added to GenomicRanges/IRanges?
>>>> Florian
>>>>
>>>> I have a slightly tricky application for which I need to reduce a
>>>>GRanges
>>>> object, but I would like to be able to process some of the original
>>>> elementMetadata of the merged ranges later. The only way I was able to
>>>> figure out which of the original ranges correspond to the merged ranges
>>>> was to perform a findOverlaps operation, but of course that is rather
>>>> costly. Is there a way to get the merge information out of the original
>>>> reduce call?
>>>> Here is a brief example:
>>>>
>>>> gr <- GRanges(seqnames="chr1", ranges=IRanges(start=c(1,6,12,24,27),
>>>> width=5), foo=1:5, bar=letters[1:5])
>>>> gr2 <- reduce(gr, min.gapwidth=1)
>>>> ind <- queryHits(findOverlaps(gr2, gr))
>>>> split(values(gr), ind)
>>>>
>>>>
>>>> Unfortunately, this is the idiom. I could see an improvement where
>>>>reduce
>>>> or a similarly named function would return a Hits object (in addition
>>>>to
>>>> the actual reduce result) that would indicate the mapping between the
>>>> input and reduced ranges. The RangesMapping structure would be really
>>>> close to what we would need.
>>>>
>>>> Michael
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list