[Bioc-devel] mapping between original and reduced ranges

Hahne, Florian florian.hahne at novartis.com
Thu Mar 15 23:52:02 CET 2012


I kind of like the attribute solution. After all, the mapping is some sort
of supplementary information that may be useful later for certain
operations, but the main purpose of reduce is to get a modified ranges
object. It may not be the most puristic R, but changing the return value
depending on argument settings always feels fishy...
Having another function to do the job is ok for me, but the burden on the
user to master the already huge number of functions and methods is also
something we should worry about.
As you said before, lots of pros and cons for every solution :-(
Florian


Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127
Email : florian.hahne at novartis.com







On 3/15/12 11:26 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:

>On 03/15/2012 02:40 PM, Kasper Daniel Hansen wrote:
>> I'll vote against the attribute solution and for a solution where the
>> type of return object gets changed, for example into a list.
>
>Thanks for voting!
>
>Problem with this is how you handle 'with.mapping=TRUE' when the input
>is GRangesList. Do you return
>
>   (1) a list of the same length as the input 'x', where the i-th
>       top-level element is itself the 2-element list returned
>       by reduce(x[[i]], with.mapping=TRUE)
>
>   (2) or a 2-element list where 1 element is the reduced GRangesList
>       and the other element is an IntegerList representing the
>       list of mappings?
>
>(1) would be very inefficient because the returned object would need
>to be populated with hundreds of thousands of S4 instances.
>
>(2) disrupts too much how reduce() is expected to behave on a
>GRangesList object i.e. it's expected to operate in a "vectorized"
>fashion, that is, each top-level element in the input is reduced
>independently of the others and all the results are stored in a
>list-like object of the *same length* as the input. So we have nice
>properties like
>
>     reduce(x, ...)[[i]] is identical to reduce(x[[i]], ...)
>
>Here that would not be the case anymore :-/
>
>More generally speaking, I would not give up on the "endomorphism"
>nature of reduce() so easily. It gives us good things like for
>example its behaviour on a GRangesList object can be explained
>as easily as with
>
>     endoapply(x, reduce, ....)
>
>*whatever* arguments/parameters/toggles are passed to it. This
>makes the documentation *much* easier to write and also it makes
>writing unit test much easier.
>
>So if we really want to go for the list solution, I would suggest
>that this is done outside reduce() e.g. in reduceAndMap() or
>something like that.
>
>Cheers,
>H.
>
>>
>> Kasper
>>
>> 2012/3/15 Hervé Pagès<hpages at fhcrc.org>:
>>> On 03/15/2012 12:45 PM, Cook, Malcolm wrote:
>>>>
>>>> Hi Herve,
>>>>
>>>> I've not used attributes to return values before.
>>>>
>>>> I guess it would work, and I won't object further if you do it this
>>>>way,
>>>> but, since you asked
>>>>
>>>> Again, it "feels wrong" in violating functional
>>>>
>>>> I suspect there may be issues with memory management.  When does the
>>>> attribute get gc-ed?  When the object does?  If so, then, retaining
>>>>the
>>>> attribute in memory when not needed _could_ be a burden, no?
>>>>
>>>> Back in my lisp days, this is when I would use `values` and
>>>> `multiple-value-bind` (and friends) when I wanted a function to
>>>>(optionally)
>>>> return multiple values.
>>>>
>>>> But this is R.
>>>>
>>>> Would you consider returning instead a list of values, keyed by
>>>>`value`
>>>> and `hits`, but only when with.hits
>>>>
>>>> BTW: with.inframe.attrib is documented as 'For internal use'.  What
>>>>does
>>>> it return in the attr?
>>>
>>>
>>> AFAIK, it's only supported by the "reduce" methods for IRanges objects.
>>>
>>> The "inframe" attribute contains an IRanges object of the same length
>>>as
>>> the input. For each range in the input it tells you the position of
>>> that range with respect to the "frame" i.e. the space obtained by
>>> pasting together the ranges in the reduce object:
>>>
>>>
>>>   >  ir
>>>   IRanges of length 5
>>>       start end width
>>>   [1]    24  28     5
>>>   [2]    27  31     5
>>>   [3]     1   5     5
>>>   [4]     6  10     5
>>>   [5]    12  16     5
>>>
>>>   >  ir2<- reduce(ir, with.inframe.attrib=TRUE)
>>>   >  ir2
>>>   IRanges of length 3
>>>       start end width
>>>   [1]     1  10    10
>>>   [2]    12  16     5
>>>   [3]    24  31     8
>>>   >  attr(ir2, "inframe")
>>>   IRanges of length 5
>>>       start end width
>>>   [1]    16  20     5
>>>   [2]    19  23     5
>>>   [3]     1   5     5
>>>   [4]     6  10     5
>>>   [5]    11  15     5
>>>
>>>
>>>                   1    1    2    2    3
>>>          1...5....0....5....0....5....0.<- standard coordinate system
>>>   ir[1]                         xxxxx
>>>   ir[2]                            xxxxx
>>>   ir[3]  xxxxx
>>>   ir[4]       xxxxx
>>>   ir[5]             xxxxx
>>>
>>>   ir2:   xxxxxxxxxx xxxxx       xxxxxxxx
>>>
>>>          1...5....1 ....1       ....2...<- "frame" coordinate system
>>>                   0     5           0
>>>
>>> I'll document this.
>>>
>>> H.
>>>
>>>
>>>>
>>>> Thanks for listening!
>>>>
>>>> ~Malcolm
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-bounces at r-
>>>>> project.org] On Behalf Of Hervé Pagès
>>>>> Sent: Thursday, March 15, 2012 1:55 PM
>>>>> To: Kasper Daniel Hansen
>>>>> Cc: bioc-devel at r-project.org
>>>>> Subject: Re: [Bioc-devel] mapping between original and reduced ranges
>>>>>
>>>>> Hi reducers,
>>>>>
>>>>> I agree it "feels wrong" to use findOverlaps() to extract the mapping
>>>>> from original to reduced ranges. Even if it can be computed very
>>>>>easily
>>>>> with:
>>>>>
>>>>>     findOverlaps(gr, reduce(gr), select="first")
>>>>>
>>>>> (Note that using 'queryHits(findOverlaps(reduce(gr), gr))' only
>>>>>produces
>>>>> the correct result if 'gr' is already sorted by increasing order.)
>>>>>
>>>>> I think it would be easy for reduce() internal code to produce this
>>>>> mapping. The question is: how do we give it back to the user?
>>>>>
>>>>> Is it OK to use an attribute for this? reduce() already uses this
>>>>> for returning some extra information about the reduction:
>>>>>
>>>>>     >    ir
>>>>>     IRanges of length 5
>>>>>         start end width
>>>>>     [1]     1   5     5
>>>>>     [2]     6  10     5
>>>>>     [3]    12  16     5
>>>>>     [4]    24  28     5
>>>>>     [5]    27  31     5
>>>>>     >    ir2<- reduce(ir, with.inframe.attrib=TRUE)
>>>>>     >    ir2
>>>>>     IRanges of length 3
>>>>>         start end width
>>>>>     [1]     1  10    10
>>>>>     [2]    12  16     5
>>>>>     [3]    24  31     8
>>>>>     >    attr(ir2, "inframe")
>>>>>     IRanges of length 5
>>>>>         start end width
>>>>>     [1]     1   5     5
>>>>>     [2]     6  10     5
>>>>>     [3]    11  15     5
>>>>>     [4]    16  20     5
>>>>>     [5]    19  23     5
>>>>>
>>>>> We could to the same thing for the mapping from original to reduced
>>>>> ranges with e.g. an argument called 'with.mapping.attrib'.
>>>>> Would that work?
>>>>>
>>>>> Cheers,
>>>>> H.
>>>>>
>>>>>
>>>>> On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
>>>>>>
>>>>>> So the key question is to what extent keeping track of where the
>>>>>> ranges comes from would slow down the reduce operation.  I am not
>>>>>> familiar enough with the algorithm to know this, but given how fast
>>>>>> IRanges is in general, I am not one for guessing on this.
>>>>>>
>>>>>> I agree with Florian that this is a very typical use case.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>> On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
>>>>>> <florian.hahne at novartis.com>     wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>> It is true that this is not terribly slow when you deal with fairly
>>>>>>> large
>>>>>>> range objects:
>>>>>>>
>>>>>>> foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE),
>>>>>>> ranges=IRanges(start=as.integer(runif(min=1, max=1e7, n=1e6)),
>>>>>
>>>>> width=50))
>>>>>>>
>>>>>>> system.time(bar<- reduce(foo))
>>>>>>>     user  system elapsed
>>>>>>>    0.918   0.174   1.091
>>>>>>>
>>>>>>> system.time(foobar<- findOverlaps(foo, bar))
>>>>>>>     user  system elapsed
>>>>>>>    2.051   0.402   2.453
>>>>>>>
>>>>>>>
>>>>>>> However the whole process does take about 3x the time of just the
>>>>>
>>>>> reduce
>>>>>>>
>>>>>>> operation, and in my use case I want this to happen interactively,
>>>>>>> where
>>>>>>> waiting 3 seconds compared to 1 makes a huge difference...
>>>>>>>
>>>>>>> I wouldn't push this high up on the development agenda, but it
>>>>>>>seems to
>>>>>
>>>>> be
>>>>>>>
>>>>>>> something that is already 95% existing and could easily be added.
>>>>>>>But
>>>>>>> maybe I am wrong...
>>>>>>>
>>>>>>> Florian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Florian Hahne
>>>>>>> Novartis Institute For Biomedical Research
>>>>>>> Translational Sciences / Preclinical Safety / PCS Informatics
>>>>>>> Expert Data Integration and Modeling Bioinformatics
>>>>>>> CHBS, WKL-135.2.26
>>>>>>> Novartis Institute For Biomedical Research, Werk Klybeck
>>>>>>> Klybeckstrasse 141
>>>>>>> CH-4057 Basel
>>>>>>> Switzerland
>>>>>>> Phone: +41 61 6967127
>>>>>>> Email : florian.hahne at novartis.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/14/12 9:40 PM, "Kasper Daniel
>>>>>
>>>>> Hansen"<kasperdanielhansen at gmail.com>
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We have discussed this a couple of times.  I routinely uses the
>>>>>>>>reduce
>>>>>>>> followed by findOverlaps paradigm.  As Malcolm says it feels
>>>>>>>>wrong,
>>>>>>>> but from a practical point of view it is pretty fast, so I stopped
>>>>>>>> worrying about it.  I only think there is a reason to do this, if
>>>>>>>>it
>>>>>>>> is substantially faster.
>>>>>>>>
>>>>>>>> Kasper
>>>>>>>>
>>>>>>>> On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org>
>>>>>
>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Chiming in....
>>>>>>>>>
>>>>>>>>> on a similar note....
>>>>>>>>>
>>>>>>>>> A version of `disjoin` which returns a Hits/RangesMapping
>>>>>>>>>additional
>>>>>>>>> to
>>>>>>>>> the GRanges result would be most useful  and probably not require
>>>>>
>>>>> much
>>>>>>>>>
>>>>>>>>> additional effort (assuming `disjoin` computes this internally)
>>>>>>>>>
>>>>>>>>> Of course, it is easy to live without since I can just perform
>>>>>>>>>the
>>>>>>>>> findOverlaps myself after the disjoin.... it just "feels wrong"
>>>>>>>>>(tm)
>>>>>>>>>
>>>>>>>>> Ahoy!
>>>>>>>>>
>>>>>>>>> ~Malcolm
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: bioc-devel-bounces at r-project.org [mailto:bioc-devel-
>>>>>
>>>>> bounces at r-
>>>>>>>>>>
>>>>>>>>>> project.org] On Behalf Of Hahne, Florian
>>>>>>>>>> Sent: Wednesday, March 14, 2012 2:22 PM
>>>>>>>>>> To: bioc-devel at r-project.org
>>>>>>>>>> Subject: [Bioc-devel] mapping between original and reduced
>>>>>>>>>>ranges
>>>>>>>>>>
>>>>>>>>>> This bounced before, guess the mailing list does not like HTML
>>>>>>>>>> mails.
>>>>>>>>>> So
>>>>>>>>>> one more try:
>>>>>>>>>>
>>>>>>>>>> I had the following offline discussion with Michael about how
>>>>>>>>>>one
>>>>>
>>>>> could
>>>>>>>>>>
>>>>>>>>>> retain a mapping of the ranges in a GRanges object before and
>>>>>>>>>>after
>>>>>>>>>> reduce. He suggested to take it to the list. Is that something
>>>>>>>>>>that
>>>>>>>>>> could
>>>>>>>>>> be added to GenomicRanges/IRanges?
>>>>>>>>>> Florian
>>>>>>>>>>
>>>>>>>>>> I have a slightly tricky application for which I need to reduce
>>>>>>>>>>a
>>>>>>>>>> GRanges
>>>>>>>>>> object, but I would like to be able to process some of the
>>>>>>>>>>original
>>>>>>>>>> elementMetadata of the merged ranges later. The only way I was
>>>>>
>>>>> able to
>>>>>>>>>>
>>>>>>>>>> figure out which of the original ranges correspond to the merged
>>>>>
>>>>> ranges
>>>>>>>>>>
>>>>>>>>>> was to perform a findOverlaps operation, but of course that is
>>>>>>>>>> rather
>>>>>>>>>> costly. Is there a way to get the merge information out of the
>>>>>>>>>> original
>>>>>>>>>> reduce call?
>>>>>>>>>> Here is a brief example:
>>>>>>>>>>
>>>>>>>>>> gr<- GRanges(seqnames="chr1",
>>>>>
>>>>> ranges=IRanges(start=c(1,6,12,24,27),
>>>>>>>>>>
>>>>>>>>>> width=5), foo=1:5, bar=letters[1:5])
>>>>>>>>>> gr2<- reduce(gr, min.gapwidth=1)
>>>>>>>>>> ind<- queryHits(findOverlaps(gr2, gr))
>>>>>>>>>> split(values(gr), ind)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Unfortunately, this is the idiom. I could see an improvement
>>>>>>>>>>where
>>>>>>>>>> reduce
>>>>>>>>>> or a similarly named function would return a Hits object (in
>>>>>>>>>> addition
>>>>>>>>>> to
>>>>>>>>>> the actual reduce result) that would indicate the mapping
>>>>>>>>>>between
>>>>>
>>>>> the
>>>>>>>>>>
>>>>>>>>>> input and reduced ranges. The RangesMapping structure would be
>>>>>
>>>>> really
>>>>>>>>>>
>>>>>>>>>> close to what we would need.
>>>>>>>>>>
>>>>>>>>>> Michael
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>
>
>-- 
>Hervé Pagès
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone:  (206) 667-5791
>Fax:    (206) 667-1319
>
>_______________________________________________
>Bioc-devel at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list