[Bioc-devel] mapping between original and reduced ranges
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Thu Mar 15 22:40:24 CET 2012
I'll vote against the attribute solution and for a solution where the
type of return object gets changed, for example into a list.
Kasper
2012/3/15 Hervé Pagès <hpages at fhcrc.org>:
> On 03/15/2012 12:45 PM, Cook, Malcolm wrote:
>>
>> Hi Herve,
>>
>> I've not used attributes to return values before.
>>
>> I guess it would work, and I won't object further if you do it this way,
>> but, since you asked
>>
>> Again, it "feels wrong" in violating functional
>>
>> I suspect there may be issues with memory management. When does the
>> attribute get gc-ed? When the object does? If so, then, retaining the
>> attribute in memory when not needed _could_ be a burden, no?
>>
>> Back in my lisp days, this is when I would use `values` and
>> `multiple-value-bind` (and friends) when I wanted a function to (optionally)
>> return multiple values.
>>
>> But this is R.
>>
>> Would you consider returning instead a list of values, keyed by `value`
>> and `hits`, but only when with.hits
>>
>> BTW: with.inframe.attrib is documented as 'For internal use'. What does
>> it return in the attr?
>
>
> AFAIK, it's only supported by the "reduce" methods for IRanges objects.
>
> The "inframe" attribute contains an IRanges object of the same length as
> the input. For each range in the input it tells you the position of
> that range with respect to the "frame" i.e. the space obtained by
> pasting together the ranges in the reduce object:
>
>
> > ir
> IRanges of length 5
> start end width
> [1] 24 28 5
> [2] 27 31 5
> [3] 1 5 5
> [4] 6 10 5
> [5] 12 16 5
>
> > ir2 <- reduce(ir, with.inframe.attrib=TRUE)
> > ir2
> IRanges of length 3
> start end width
> [1] 1 10 10
> [2] 12 16 5
> [3] 24 31 8
> > attr(ir2, "inframe")
> IRanges of length 5
> start end width
> [1] 16 20 5
> [2] 19 23 5
> [3] 1 5 5
> [4] 6 10 5
> [5] 11 15 5
>
>
> 1 1 2 2 3
> 1...5....0....5....0....5....0. <- standard coordinate system
> ir[1] xxxxx
> ir[2] xxxxx
> ir[3] xxxxx
> ir[4] xxxxx
> ir[5] xxxxx
>
> ir2: xxxxxxxxxx xxxxx xxxxxxxx
>
> 1...5....1 ....1 ....2... <- "frame" coordinate system
> 0 5 0
>
> I'll document this.
>
> H.
>
>
>>
>> Thanks for listening!
>>
>> ~Malcolm
>>
>>
>>> -----Original Message-----
>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
>>> project.org] On Behalf Of Hervé Pagès
>>> Sent: Thursday, March 15, 2012 1:55 PM
>>> To: Kasper Daniel Hansen
>>> Cc: bioc-devel at r-project.org
>>> Subject: Re: [Bioc-devel] mapping between original and reduced ranges
>>>
>>> Hi reducers,
>>>
>>> I agree it "feels wrong" to use findOverlaps() to extract the mapping
>>> from original to reduced ranges. Even if it can be computed very easily
>>> with:
>>>
>>> findOverlaps(gr, reduce(gr), select="first")
>>>
>>> (Note that using 'queryHits(findOverlaps(reduce(gr), gr))' only produces
>>> the correct result if 'gr' is already sorted by increasing order.)
>>>
>>> I think it would be easy for reduce() internal code to produce this
>>> mapping. The question is: how do we give it back to the user?
>>>
>>> Is it OK to use an attribute for this? reduce() already uses this
>>> for returning some extra information about the reduction:
>>>
>>> > ir
>>> IRanges of length 5
>>> start end width
>>> [1] 1 5 5
>>> [2] 6 10 5
>>> [3] 12 16 5
>>> [4] 24 28 5
>>> [5] 27 31 5
>>> > ir2<- reduce(ir, with.inframe.attrib=TRUE)
>>> > ir2
>>> IRanges of length 3
>>> start end width
>>> [1] 1 10 10
>>> [2] 12 16 5
>>> [3] 24 31 8
>>> > attr(ir2, "inframe")
>>> IRanges of length 5
>>> start end width
>>> [1] 1 5 5
>>> [2] 6 10 5
>>> [3] 11 15 5
>>> [4] 16 20 5
>>> [5] 19 23 5
>>>
>>> We could to the same thing for the mapping from original to reduced
>>> ranges with e.g. an argument called 'with.mapping.attrib'.
>>> Would that work?
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
>>>>
>>>> So the key question is to what extent keeping track of where the
>>>> ranges comes from would slow down the reduce operation. I am not
>>>> familiar enough with the algorithm to know this, but given how fast
>>>> IRanges is in general, I am not one for guessing on this.
>>>>
>>>> I agree with Florian that this is a very typical use case.
>>>>
>>>> Kasper
>>>>
>>>> On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
>>>> <florian.hahne at novartis.com> wrote:
>>>>>
>>>>> Hi all,
>>>>> It is true that this is not terribly slow when you deal with fairly
>>>>> large
>>>>> range objects:
>>>>>
>>>>> foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE),
>>>>> ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)),
>>>
>>> width=50))
>>>>>
>>>>> system.time(bar<- reduce(foo))
>>>>> user system elapsed
>>>>> 0.918 0.174 1.091
>>>>>
>>>>> system.time(foobar<- findOverlaps(foo, bar))
>>>>> user system elapsed
>>>>> 2.051 0.402 2.453
>>>>>
>>>>>
>>>>> However the whole process does take about 3x the time of just the
>>>
>>> reduce
>>>>>
>>>>> operation, and in my use case I want this to happen interactively,
>>>>> where
>>>>> waiting 3 seconds compared to 1 makes a huge difference...
>>>>>
>>>>> I wouldn't push this high up on the development agenda, but it seems to
>>>
>>> be
>>>>>
>>>>> something that is already 95% existing and could easily be added. But
>>>>> maybe I am wrong...
>>>>>
>>>>> Florian
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Florian Hahne
>>>>> Novartis Institute For Biomedical Research
>>>>> Translational Sciences / Preclinical Safety / PCS Informatics
>>>>> Expert Data Integration and Modeling Bioinformatics
>>>>> CHBS, WKL-135.2.26
>>>>> Novartis Institute For Biomedical Research, Werk Klybeck
>>>>> Klybeckstrasse 141
>>>>> CH-4057 Basel
>>>>> Switzerland
>>>>> Phone: +41 61 6967127
>>>>> Email : florian.hahne at novartis.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 3/14/12 9:40 PM, "Kasper Daniel
>>>
>>> Hansen"<kasperdanielhansen at gmail.com>
>>>>>
>>>>> wrote:
>>>>>
>>>>>> We have discussed this a couple of times. I routinely uses the reduce
>>>>>> followed by findOverlaps paradigm. As Malcolm says it feels wrong,
>>>>>> but from a practical point of view it is pretty fast, so I stopped
>>>>>> worrying about it. I only think there is a reason to do this, if it
>>>>>> is substantially faster.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>> On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org>
>>>
>>> wrote:
>>>>>>>
>>>>>>> Chiming in....
>>>>>>>
>>>>>>> on a similar note....
>>>>>>>
>>>>>>> A version of `disjoin` which returns a Hits/RangesMapping additional
>>>>>>> to
>>>>>>> the GRanges result would be most useful and probably not require
>>>
>>> much
>>>>>>>
>>>>>>> additional effort (assuming `disjoin` computes this internally)
>>>>>>>
>>>>>>> Of course, it is easy to live without since I can just perform the
>>>>>>> findOverlaps myself after the disjoin.... it just "feels wrong" (tm)
>>>>>>>
>>>>>>> Ahoy!
>>>>>>>
>>>>>>> ~Malcolm
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-
>>>
>>> bounces at r-
>>>>>>>>
>>>>>>>> project.org] On Behalf Of Hahne, Florian
>>>>>>>> Sent: Wednesday, March 14, 2012 2:22 PM
>>>>>>>> To: bioc-devel at r-project.org
>>>>>>>> Subject: [Bioc-devel] mapping between original and reduced ranges
>>>>>>>>
>>>>>>>> This bounced before, guess the mailing list does not like HTML
>>>>>>>> mails.
>>>>>>>> So
>>>>>>>> one more try:
>>>>>>>>
>>>>>>>> I had the following offline discussion with Michael about how one
>>>
>>> could
>>>>>>>>
>>>>>>>> retain a mapping of the ranges in a GRanges object before and after
>>>>>>>> reduce. He suggested to take it to the list. Is that something that
>>>>>>>> could
>>>>>>>> be added to GenomicRanges/IRanges?
>>>>>>>> Florian
>>>>>>>>
>>>>>>>> I have a slightly tricky application for which I need to reduce a
>>>>>>>> GRanges
>>>>>>>> object, but I would like to be able to process some of the original
>>>>>>>> elementMetadata of the merged ranges later. The only way I was
>>>
>>> able to
>>>>>>>>
>>>>>>>> figure out which of the original ranges correspond to the merged
>>>
>>> ranges
>>>>>>>>
>>>>>>>> was to perform a findOverlaps operation, but of course that is
>>>>>>>> rather
>>>>>>>> costly. Is there a way to get the merge information out of the
>>>>>>>> original
>>>>>>>> reduce call?
>>>>>>>> Here is a brief example:
>>>>>>>>
>>>>>>>> gr<- GRanges(seqnames="chr1",
>>>
>>> ranges=IRanges(start=c(1,6,12,24,27),
>>>>>>>>
>>>>>>>> width=5), foo=1:5, bar=letters[1:5])
>>>>>>>> gr2<- reduce(gr, min.gapwidth=1)
>>>>>>>> ind<- queryHits(findOverlaps(gr2, gr))
>>>>>>>> split(values(gr), ind)
>>>>>>>>
>>>>>>>>
>>>>>>>> Unfortunately, this is the idiom. I could see an improvement where
>>>>>>>> reduce
>>>>>>>> or a similarly named function would return a Hits object (in
>>>>>>>> addition
>>>>>>>> to
>>>>>>>> the actual reduce result) that would indicate the mapping between
>>>
>>> the
>>>>>>>>
>>>>>>>> input and reduced ranges. The RangesMapping structure would be
>>>
>>> really
>>>>>>>>
>>>>>>>> close to what we would need.
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioc-devel
mailing list