[Bioc-devel] mapping between original and reduced ranges

Hervé Pagès hpages at fhcrc.org
Thu Mar 15 19:55:25 CET 2012


Hi reducers,

I agree it "feels wrong" to use findOverlaps() to extract the mapping
from original to reduced ranges. Even if it can be computed very easily
with:

   findOverlaps(gr, reduce(gr), select="first")

(Note that using 'queryHits(findOverlaps(reduce(gr), gr))' only produces
the correct result if 'gr' is already sorted by increasing order.)

I think it would be easy for reduce() internal code to produce this
mapping. The question is: how do we give it back to the user?

Is it OK to use an attribute for this? reduce() already uses this
for returning some extra information about the reduction:

   > ir
   IRanges of length 5
       start end width
   [1]     1   5     5
   [2]     6  10     5
   [3]    12  16     5
   [4]    24  28     5
   [5]    27  31     5
   > ir2 <- reduce(ir, with.inframe.attrib=TRUE)
   > ir2
   IRanges of length 3
       start end width
   [1]     1  10    10
   [2]    12  16     5
   [3]    24  31     8
   > attr(ir2, "inframe")
   IRanges of length 5
       start end width
   [1]     1   5     5
   [2]     6  10     5
   [3]    11  15     5
   [4]    16  20     5
   [5]    19  23     5

We could to the same thing for the mapping from original to reduced
ranges with e.g. an argument called 'with.mapping.attrib'.
Would that work?

Cheers,
H.


On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
> So the key question is to what extent keeping track of where the
> ranges comes from would slow down the reduce operation.  I am not
> familiar enough with the algorithm to know this, but given how fast
> IRanges is in general, I am not one for guessing on this.
>
> I agree with Florian that this is a very typical use case.
>
> Kasper
>
> On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
> <florian.hahne at novartis.com>  wrote:
>> Hi all,
>> It is true that this is not terribly slow when you deal with fairly large
>> range objects:
>>
>> foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE),
>> ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)), width=50))
>> system.time(bar<- reduce(foo))
>>    user  system elapsed
>>   0.918   0.174   1.091
>>
>> system.time(foobar<- findOverlaps(foo, bar))
>>    user  system elapsed
>>   2.051   0.402   2.453
>>
>>
>> However the whole process does take about 3x the time of just the reduce
>> operation, and in my use case I want this to happen interactively, where
>> waiting 3 seconds compared to 1 makes a huge difference...
>>
>> I wouldn't push this high up on the development agenda, but it seems to be
>> something that is already 95% existing and could easily be added. But
>> maybe I am wrong...
>>
>> Florian
>>
>>
>>
>>
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127
>> Email : florian.hahne at novartis.com
>>
>>
>>
>>
>>
>>
>>
>> On 3/14/12 9:40 PM, "Kasper Daniel Hansen"<kasperdanielhansen at gmail.com>
>> wrote:
>>
>>> We have discussed this a couple of times.  I routinely uses the reduce
>>> followed by findOverlaps paradigm.  As Malcolm says it feels wrong,
>>> but from a practical point of view it is pretty fast, so I stopped
>>> worrying about it.  I only think there is a reason to do this, if it
>>> is substantially faster.
>>>
>>> Kasper
>>>
>>> On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org>  wrote:
>>>> Chiming in....
>>>>
>>>> on a similar note....
>>>>
>>>> A version of `disjoin` which returns a Hits/RangesMapping additional to
>>>> the GRanges result would be most useful  and probably not require much
>>>> additional effort (assuming `disjoin` computes this internally)
>>>>
>>>> Of course, it is easy to live without since I can just perform the
>>>> findOverlaps myself after the disjoin.... it just "feels wrong" (tm)
>>>>
>>>> Ahoy!
>>>>
>>>> ~Malcolm
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
>>>>> project.org] On Behalf Of Hahne, Florian
>>>>> Sent: Wednesday, March 14, 2012 2:22 PM
>>>>> To: bioc-devel at r-project.org
>>>>> Subject: [Bioc-devel] mapping between original and reduced ranges
>>>>>
>>>>> This bounced before, guess the mailing list does not like HTML mails.
>>>>> So
>>>>> one more try:
>>>>>
>>>>> I had the following offline discussion with Michael about how one could
>>>>> retain a mapping of the ranges in a GRanges object before and after
>>>>> reduce. He suggested to take it to the list. Is that something that
>>>>> could
>>>>> be added to GenomicRanges/IRanges?
>>>>> Florian
>>>>>
>>>>> I have a slightly tricky application for which I need to reduce a
>>>>> GRanges
>>>>> object, but I would like to be able to process some of the original
>>>>> elementMetadata of the merged ranges later. The only way I was able to
>>>>> figure out which of the original ranges correspond to the merged ranges
>>>>> was to perform a findOverlaps operation, but of course that is rather
>>>>> costly. Is there a way to get the merge information out of the original
>>>>> reduce call?
>>>>> Here is a brief example:
>>>>>
>>>>> gr<- GRanges(seqnames="chr1", ranges=IRanges(start=c(1,6,12,24,27),
>>>>> width=5), foo=1:5, bar=letters[1:5])
>>>>> gr2<- reduce(gr, min.gapwidth=1)
>>>>> ind<- queryHits(findOverlaps(gr2, gr))
>>>>> split(values(gr), ind)
>>>>>
>>>>>
>>>>> Unfortunately, this is the idiom. I could see an improvement where
>>>>> reduce
>>>>> or a similarly named function would return a Hits object (in addition
>>>>> to
>>>>> the actual reduce result) that would indicate the mapping between the
>>>>> input and reduced ranges. The RangesMapping structure would be really
>>>>> close to what we would need.
>>>>>
>>>>> Michael
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list