[Bioc-devel] mapping between original and reduced ranges
Hervé Pagès
hpages at fhcrc.org
Thu Mar 15 19:55:25 CET 2012
Hi reducers,
I agree it "feels wrong" to use findOverlaps() to extract the mapping
from original to reduced ranges. Even if it can be computed very easily
with:
findOverlaps(gr, reduce(gr), select="first")
(Note that using 'queryHits(findOverlaps(reduce(gr), gr))' only produces
the correct result if 'gr' is already sorted by increasing order.)
I think it would be easy for reduce() internal code to produce this
mapping. The question is: how do we give it back to the user?
Is it OK to use an attribute for this? reduce() already uses this
for returning some extra information about the reduction:
> ir
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 12 16 5
[4] 24 28 5
[5] 27 31 5
> ir2 <- reduce(ir, with.inframe.attrib=TRUE)
> ir2
IRanges of length 3
start end width
[1] 1 10 10
[2] 12 16 5
[3] 24 31 8
> attr(ir2, "inframe")
IRanges of length 5
start end width
[1] 1 5 5
[2] 6 10 5
[3] 11 15 5
[4] 16 20 5
[5] 19 23 5
We could to the same thing for the mapping from original to reduced
ranges with e.g. an argument called 'with.mapping.attrib'.
Would that work?
Cheers,
H.
On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
> So the key question is to what extent keeping track of where the
> ranges comes from would slow down the reduce operation. I am not
> familiar enough with the algorithm to know this, but given how fast
> IRanges is in general, I am not one for guessing on this.
>
> I agree with Florian that this is a very typical use case.
>
> Kasper
>
> On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
> <florian.hahne at novartis.com> wrote:
>> Hi all,
>> It is true that this is not terribly slow when you deal with fairly large
>> range objects:
>>
>> foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE),
>> ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)), width=50))
>> system.time(bar<- reduce(foo))
>> user system elapsed
>> 0.918 0.174 1.091
>>
>> system.time(foobar<- findOverlaps(foo, bar))
>> user system elapsed
>> 2.051 0.402 2.453
>>
>>
>> However the whole process does take about 3x the time of just the reduce
>> operation, and in my use case I want this to happen interactively, where
>> waiting 3 seconds compared to 1 makes a huge difference...
>>
>> I wouldn't push this high up on the development agenda, but it seems to be
>> something that is already 95% existing and could easily be added. But
>> maybe I am wrong...
>>
>> Florian
>>
>>
>>
>>
>> Florian Hahne
>> Novartis Institute For Biomedical Research
>> Translational Sciences / Preclinical Safety / PCS Informatics
>> Expert Data Integration and Modeling Bioinformatics
>> CHBS, WKL-135.2.26
>> Novartis Institute For Biomedical Research, Werk Klybeck
>> Klybeckstrasse 141
>> CH-4057 Basel
>> Switzerland
>> Phone: +41 61 6967127
>> Email : florian.hahne at novartis.com
>>
>>
>>
>>
>>
>>
>>
>> On 3/14/12 9:40 PM, "Kasper Daniel Hansen"<kasperdanielhansen at gmail.com>
>> wrote:
>>
>>> We have discussed this a couple of times. I routinely uses the reduce
>>> followed by findOverlaps paradigm. As Malcolm says it feels wrong,
>>> but from a practical point of view it is pretty fast, so I stopped
>>> worrying about it. I only think there is a reason to do this, if it
>>> is substantially faster.
>>>
>>> Kasper
>>>
>>> On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org> wrote:
>>>> Chiming in....
>>>>
>>>> on a similar note....
>>>>
>>>> A version of `disjoin` which returns a Hits/RangesMapping additional to
>>>> the GRanges result would be most useful and probably not require much
>>>> additional effort (assuming `disjoin` computes this internally)
>>>>
>>>> Of course, it is easy to live without since I can just perform the
>>>> findOverlaps myself after the disjoin.... it just "feels wrong" (tm)
>>>>
>>>> Ahoy!
>>>>
>>>> ~Malcolm
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
>>>>> project.org] On Behalf Of Hahne, Florian
>>>>> Sent: Wednesday, March 14, 2012 2:22 PM
>>>>> To: bioc-devel at r-project.org
>>>>> Subject: [Bioc-devel] mapping between original and reduced ranges
>>>>>
>>>>> This bounced before, guess the mailing list does not like HTML mails.
>>>>> So
>>>>> one more try:
>>>>>
>>>>> I had the following offline discussion with Michael about how one could
>>>>> retain a mapping of the ranges in a GRanges object before and after
>>>>> reduce. He suggested to take it to the list. Is that something that
>>>>> could
>>>>> be added to GenomicRanges/IRanges?
>>>>> Florian
>>>>>
>>>>> I have a slightly tricky application for which I need to reduce a
>>>>> GRanges
>>>>> object, but I would like to be able to process some of the original
>>>>> elementMetadata of the merged ranges later. The only way I was able to
>>>>> figure out which of the original ranges correspond to the merged ranges
>>>>> was to perform a findOverlaps operation, but of course that is rather
>>>>> costly. Is there a way to get the merge information out of the original
>>>>> reduce call?
>>>>> Here is a brief example:
>>>>>
>>>>> gr<- GRanges(seqnames="chr1", ranges=IRanges(start=c(1,6,12,24,27),
>>>>> width=5), foo=1:5, bar=letters[1:5])
>>>>> gr2<- reduce(gr, min.gapwidth=1)
>>>>> ind<- queryHits(findOverlaps(gr2, gr))
>>>>> split(values(gr), ind)
>>>>>
>>>>>
>>>>> Unfortunately, this is the idiom. I could see an improvement where
>>>>> reduce
>>>>> or a similarly named function would return a Hits object (in addition
>>>>> to
>>>>> the actual reduce result) that would indicate the mapping between the
>>>>> input and reduced ranges. The RangesMapping structure would be really
>>>>> close to what we would need.
>>>>>
>>>>> Michael
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list