[Bioc-devel] mapping between original and reduced ranges
Hervé Pagès
hpages at fhcrc.org
Thu Mar 15 22:01:20 CET 2012
Hi Malcolm,
On 03/15/2012 12:45 PM, Cook, Malcolm wrote:
> Hi Herve,
>
> I've not used attributes to return values before.
>
> I guess it would work, and I won't object further if you do it this way, but, since you asked
>
> Again, it "feels wrong" in violating functional
>
> I suspect there may be issues with memory management. When does the attribute get gc-ed? When the object does?
I don't think attributes are treated differently than e.g. slots or list
elements. They are part of the object and I've never worried before
that gc would take parts of my objects away.
> If so, then, retaining the attribute in memory when not needed _could_ be a burden, no?
You would get the attribute attached to the returned object only if you
asked for it i.e. when calling reduce() with 'with.mapping.attrib=TRUE'.
>
> Back in my lisp days, this is when I would use `values` and `multiple-value-bind` (and friends) when I wanted a function to (optionally) return multiple values.
>
> But this is R.
>
> Would you consider returning instead a list of values, keyed by `value` and `hits`, but only when with.hits
I realize that using an attribute is not ideal. We try to avoid using
attributes for returning extra information as much as we can (and
reduce(..., 'with.inframe.attrib=TRUE') is the only place that I know
in the IRanges/GenomicRanges infrastructure where we are doing it).
Yes there are other options, like returning a list, or returning an
extension of IRanges/GenomicRanges that holds the mapping in an extra
slot, or putting the mapping in the metadata slot of the returned
object. Also we could have a separate function like reduceWithMapping()
to return those things. I guess all those solutions have pros and cons.
BTW, "using or not using attributes?" was asked recently on R-devel:
https://stat.ethz.ch/pipermail/r-devel/2012-January/062930.html
I don't like the attribute solution but amongst all the solutions I
can think of, it's the one I dislike less ;-) My 2nd less disliked
solution would be to use the metadata slot.
Would be good to hear some feedback from other reducers. Thanks!
H.
>
> BTW: with.inframe.attrib is documented as 'For internal use'. What does it return in the attr?
>
> Thanks for listening!
>
> ~Malcolm
>
>
>> -----Original Message-----
>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
>> project.org] On Behalf Of Hervé Pagès
>> Sent: Thursday, March 15, 2012 1:55 PM
>> To: Kasper Daniel Hansen
>> Cc: bioc-devel at r-project.org
>> Subject: Re: [Bioc-devel] mapping between original and reduced ranges
>>
>> Hi reducers,
>>
>> I agree it "feels wrong" to use findOverlaps() to extract the mapping
>> from original to reduced ranges. Even if it can be computed very easily
>> with:
>>
>> findOverlaps(gr, reduce(gr), select="first")
>>
>> (Note that using 'queryHits(findOverlaps(reduce(gr), gr))' only produces
>> the correct result if 'gr' is already sorted by increasing order.)
>>
>> I think it would be easy for reduce() internal code to produce this
>> mapping. The question is: how do we give it back to the user?
>>
>> Is it OK to use an attribute for this? reduce() already uses this
>> for returning some extra information about the reduction:
>>
>> > ir
>> IRanges of length 5
>> start end width
>> [1] 1 5 5
>> [2] 6 10 5
>> [3] 12 16 5
>> [4] 24 28 5
>> [5] 27 31 5
>> > ir2<- reduce(ir, with.inframe.attrib=TRUE)
>> > ir2
>> IRanges of length 3
>> start end width
>> [1] 1 10 10
>> [2] 12 16 5
>> [3] 24 31 8
>> > attr(ir2, "inframe")
>> IRanges of length 5
>> start end width
>> [1] 1 5 5
>> [2] 6 10 5
>> [3] 11 15 5
>> [4] 16 20 5
>> [5] 19 23 5
>>
>> We could to the same thing for the mapping from original to reduced
>> ranges with e.g. an argument called 'with.mapping.attrib'.
>> Would that work?
>>
>> Cheers,
>> H.
>>
>>
>> On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
>>> So the key question is to what extent keeping track of where the
>>> ranges comes from would slow down the reduce operation. I am not
>>> familiar enough with the algorithm to know this, but given how fast
>>> IRanges is in general, I am not one for guessing on this.
>>>
>>> I agree with Florian that this is a very typical use case.
>>>
>>> Kasper
>>>
>>> On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
>>> <florian.hahne at novartis.com> wrote:
>>>> Hi all,
>>>> It is true that this is not terribly slow when you deal with fairly large
>>>> range objects:
>>>>
>>>> foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE),
>>>> ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)),
>> width=50))
>>>> system.time(bar<- reduce(foo))
>>>> user system elapsed
>>>> 0.918 0.174 1.091
>>>>
>>>> system.time(foobar<- findOverlaps(foo, bar))
>>>> user system elapsed
>>>> 2.051 0.402 2.453
>>>>
>>>>
>>>> However the whole process does take about 3x the time of just the
>> reduce
>>>> operation, and in my use case I want this to happen interactively, where
>>>> waiting 3 seconds compared to 1 makes a huge difference...
>>>>
>>>> I wouldn't push this high up on the development agenda, but it seems to
>> be
>>>> something that is already 95% existing and could easily be added. But
>>>> maybe I am wrong...
>>>>
>>>> Florian
>>>>
>>>>
>>>>
>>>>
>>>> Florian Hahne
>>>> Novartis Institute For Biomedical Research
>>>> Translational Sciences / Preclinical Safety / PCS Informatics
>>>> Expert Data Integration and Modeling Bioinformatics
>>>> CHBS, WKL-135.2.26
>>>> Novartis Institute For Biomedical Research, Werk Klybeck
>>>> Klybeckstrasse 141
>>>> CH-4057 Basel
>>>> Switzerland
>>>> Phone: +41 61 6967127
>>>> Email : florian.hahne at novartis.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 3/14/12 9:40 PM, "Kasper Daniel
>> Hansen"<kasperdanielhansen at gmail.com>
>>>> wrote:
>>>>
>>>>> We have discussed this a couple of times. I routinely uses the reduce
>>>>> followed by findOverlaps paradigm. As Malcolm says it feels wrong,
>>>>> but from a practical point of view it is pretty fast, so I stopped
>>>>> worrying about it. I only think there is a reason to do this, if it
>>>>> is substantially faster.
>>>>>
>>>>> Kasper
>>>>>
>>>>> On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org>
>> wrote:
>>>>>> Chiming in....
>>>>>>
>>>>>> on a similar note....
>>>>>>
>>>>>> A version of `disjoin` which returns a Hits/RangesMapping additional to
>>>>>> the GRanges result would be most useful and probably not require
>> much
>>>>>> additional effort (assuming `disjoin` computes this internally)
>>>>>>
>>>>>> Of course, it is easy to live without since I can just perform the
>>>>>> findOverlaps myself after the disjoin.... it just "feels wrong" (tm)
>>>>>>
>>>>>> Ahoy!
>>>>>>
>>>>>> ~Malcolm
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-
>> bounces at r-
>>>>>>> project.org] On Behalf Of Hahne, Florian
>>>>>>> Sent: Wednesday, March 14, 2012 2:22 PM
>>>>>>> To: bioc-devel at r-project.org
>>>>>>> Subject: [Bioc-devel] mapping between original and reduced ranges
>>>>>>>
>>>>>>> This bounced before, guess the mailing list does not like HTML mails.
>>>>>>> So
>>>>>>> one more try:
>>>>>>>
>>>>>>> I had the following offline discussion with Michael about how one
>> could
>>>>>>> retain a mapping of the ranges in a GRanges object before and after
>>>>>>> reduce. He suggested to take it to the list. Is that something that
>>>>>>> could
>>>>>>> be added to GenomicRanges/IRanges?
>>>>>>> Florian
>>>>>>>
>>>>>>> I have a slightly tricky application for which I need to reduce a
>>>>>>> GRanges
>>>>>>> object, but I would like to be able to process some of the original
>>>>>>> elementMetadata of the merged ranges later. The only way I was
>> able to
>>>>>>> figure out which of the original ranges correspond to the merged
>> ranges
>>>>>>> was to perform a findOverlaps operation, but of course that is rather
>>>>>>> costly. Is there a way to get the merge information out of the original
>>>>>>> reduce call?
>>>>>>> Here is a brief example:
>>>>>>>
>>>>>>> gr<- GRanges(seqnames="chr1",
>> ranges=IRanges(start=c(1,6,12,24,27),
>>>>>>> width=5), foo=1:5, bar=letters[1:5])
>>>>>>> gr2<- reduce(gr, min.gapwidth=1)
>>>>>>> ind<- queryHits(findOverlaps(gr2, gr))
>>>>>>> split(values(gr), ind)
>>>>>>>
>>>>>>>
>>>>>>> Unfortunately, this is the idiom. I could see an improvement where
>>>>>>> reduce
>>>>>>> or a similarly named function would return a Hits object (in addition
>>>>>>> to
>>>>>>> the actual reduce result) that would indicate the mapping between
>> the
>>>>>>> input and reduced ranges. The RangesMapping structure would be
>> really
>>>>>>> close to what we would need.
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list