[Bioc-devel] mapping between original and reduced ranges

Hahne, Florian florian.hahne at novartis.com
Fri Mar 16 00:13:56 CET 2012


Would such a solution also allow to keep the original elementMetadata in
the respective list representation? I assume that creates about the same
overhead as keeping the index?


Florian Hahne
Novartis Institute For Biomedical Research
Translational Sciences / Preclinical Safety / PCS Informatics
Expert Data Integration and Modeling Bioinformatics
CHBS, WKL-135.2.26
Novartis Institute For Biomedical Research, Werk Klybeck
Klybeckstrasse 141
CH-4057 Basel
Switzerland
Phone: +41 61 6967127
Email : florian.hahne at novartis.com







On 3/15/12 11:58 PM, "Michael Lawrence" <lawrence.michael at gene.com> wrote:

>I would be in favor of either the attribute or metadata solution. I could
>see having an IntegerList element in the element metadata that indicates
>the original ranges that were reduced into the returned range, or a Hits
>object in the top-level metadata. A plus and minus of the metadata
>approach
>is that it is more familiar to the users than hiding stuff in attributes,
>which is pretty low-level. However, using the metadata will increase the
>probabilty of "getting in the way". The user does need to explicitly
>request it though.
>
>Michael
>
>On Thu, Mar 15, 2012 at 3:26 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>
>> On 03/15/2012 02:40 PM, Kasper Daniel Hansen wrote:
>>
>>> I'll vote against the attribute solution and for a solution where the
>>> type of return object gets changed, for example into a list.
>>>
>>
>> Thanks for voting!
>>
>> Problem with this is how you handle 'with.mapping=TRUE' when the input
>> is GRangesList. Do you return
>>
>>  (1) a list of the same length as the input 'x', where the i-th
>>      top-level element is itself the 2-element list returned
>>      by reduce(x[[i]], with.mapping=TRUE)
>>
>>  (2) or a 2-element list where 1 element is the reduced GRangesList
>>      and the other element is an IntegerList representing the
>>      list of mappings?
>>
>> (1) would be very inefficient because the returned object would need
>> to be populated with hundreds of thousands of S4 instances.
>>
>> (2) disrupts too much how reduce() is expected to behave on a
>> GRangesList object i.e. it's expected to operate in a "vectorized"
>> fashion, that is, each top-level element in the input is reduced
>> independently of the others and all the results are stored in a
>> list-like object of the *same length* as the input. So we have nice
>> properties like
>>
>>    reduce(x, ...)[[i]] is identical to reduce(x[[i]], ...)
>>
>> Here that would not be the case anymore :-/
>>
>> More generally speaking, I would not give up on the "endomorphism"
>> nature of reduce() so easily. It gives us good things like for
>> example its behaviour on a GRangesList object can be explained
>> as easily as with
>>
>>    endoapply(x, reduce, ....)
>>
>> *whatever* arguments/parameters/toggles are passed to it. This
>> makes the documentation *much* easier to write and also it makes
>> writing unit test much easier.
>>
>> So if we really want to go for the list solution, I would suggest
>> that this is done outside reduce() e.g. in reduceAndMap() or
>> something like that.
>>
>> Cheers,
>>
>> H.
>>
>>
>>> Kasper
>>>
>>> 2012/3/15 Hervé Pagès<hpages at fhcrc.org>:
>>>
>>>> On 03/15/2012 12:45 PM, Cook, Malcolm wrote:
>>>>
>>>>>
>>>>> Hi Herve,
>>>>>
>>>>> I've not used attributes to return values before.
>>>>>
>>>>> I guess it would work, and I won't object further if you do it this
>>>>>way,
>>>>> but, since you asked
>>>>>
>>>>> Again, it "feels wrong" in violating functional
>>>>>
>>>>> I suspect there may be issues with memory management.  When does the
>>>>> attribute get gc-ed?  When the object does?  If so, then, retaining
>>>>>the
>>>>> attribute in memory when not needed _could_ be a burden, no?
>>>>>
>>>>> Back in my lisp days, this is when I would use `values` and
>>>>> `multiple-value-bind` (and friends) when I wanted a function to
>>>>> (optionally)
>>>>> return multiple values.
>>>>>
>>>>> But this is R.
>>>>>
>>>>> Would you consider returning instead a list of values, keyed by
>>>>>`value`
>>>>> and `hits`, but only when with.hits
>>>>>
>>>>> BTW: with.inframe.attrib is documented as 'For internal use'.  What
>>>>>does
>>>>> it return in the attr?
>>>>>
>>>>
>>>>
>>>> AFAIK, it's only supported by the "reduce" methods for IRanges
>>>>objects.
>>>>
>>>> The "inframe" attribute contains an IRanges object of the same length
>>>>as
>>>> the input. For each range in the input it tells you the position of
>>>> that range with respect to the "frame" i.e. the space obtained by
>>>> pasting together the ranges in the reduce object:
>>>>
>>>>
>>>>  >  ir
>>>>  IRanges of length 5
>>>>      start end width
>>>>  [1]    24  28     5
>>>>  [2]    27  31     5
>>>>  [3]     1   5     5
>>>>  [4]     6  10     5
>>>>  [5]    12  16     5
>>>>
>>>>  >  ir2<- reduce(ir, with.inframe.attrib=TRUE)
>>>>  >  ir2
>>>>  IRanges of length 3
>>>>      start end width
>>>>  [1]     1  10    10
>>>>  [2]    12  16     5
>>>>  [3]    24  31     8
>>>>  >  attr(ir2, "inframe")
>>>>  IRanges of length 5
>>>>      start end width
>>>>  [1]    16  20     5
>>>>  [2]    19  23     5
>>>>  [3]     1   5     5
>>>>  [4]     6  10     5
>>>>  [5]    11  15     5
>>>>
>>>>
>>>>                  1    1    2    2    3
>>>>         1...5....0....5....0....5....**0.<- standard coordinate system
>>>>  ir[1]                         xxxxx
>>>>  ir[2]                            xxxxx
>>>>  ir[3]  xxxxx
>>>>  ir[4]       xxxxx
>>>>  ir[5]             xxxxx
>>>>
>>>>  ir2:   xxxxxxxxxx xxxxx       xxxxxxxx
>>>>
>>>>         1...5....1 ....1       ....2...<- "frame" coordinate system
>>>>                  0     5           0
>>>>
>>>> I'll document this.
>>>>
>>>> H.
>>>>
>>>>
>>>>
>>>>> Thanks for listening!
>>>>>
>>>>> ~Malcolm
>>>>>
>>>>>
>>>>>  -----Original Message-----
>>>>>> From: 
>>>>>>bioc-devel-bounces at r-project.**org<bioc-devel-bounces at r-project.org>[
>>>>>>mailto:
>>>>>> bioc-devel-bounces at r-
>>>>>> project.org] On Behalf Of Hervé Pagès
>>>>>> Sent: Thursday, March 15, 2012 1:55 PM
>>>>>> To: Kasper Daniel Hansen
>>>>>> Cc: bioc-devel at r-project.org
>>>>>> Subject: Re: [Bioc-devel] mapping between original and reduced
>>>>>>ranges
>>>>>>
>>>>>> Hi reducers,
>>>>>>
>>>>>> I agree it "feels wrong" to use findOverlaps() to extract the
>>>>>>mapping
>>>>>> from original to reduced ranges. Even if it can be computed very
>>>>>>easily
>>>>>> with:
>>>>>>
>>>>>>    findOverlaps(gr, reduce(gr), select="first")
>>>>>>
>>>>>> (Note that using 'queryHits(findOverlaps(**reduce(gr), gr))' only
>>>>>> produces
>>>>>> the correct result if 'gr' is already sorted by increasing order.)
>>>>>>
>>>>>> I think it would be easy for reduce() internal code to produce this
>>>>>> mapping. The question is: how do we give it back to the user?
>>>>>>
>>>>>> Is it OK to use an attribute for this? reduce() already uses this
>>>>>> for returning some extra information about the reduction:
>>>>>>
>>>>>>    >    ir
>>>>>>    IRanges of length 5
>>>>>>        start end width
>>>>>>    [1]     1   5     5
>>>>>>    [2]     6  10     5
>>>>>>    [3]    12  16     5
>>>>>>    [4]    24  28     5
>>>>>>    [5]    27  31     5
>>>>>>    >    ir2<- reduce(ir, with.inframe.attrib=TRUE)
>>>>>>    >    ir2
>>>>>>    IRanges of length 3
>>>>>>        start end width
>>>>>>    [1]     1  10    10
>>>>>>    [2]    12  16     5
>>>>>>    [3]    24  31     8
>>>>>>    >    attr(ir2, "inframe")
>>>>>>    IRanges of length 5
>>>>>>        start end width
>>>>>>    [1]     1   5     5
>>>>>>    [2]     6  10     5
>>>>>>    [3]    11  15     5
>>>>>>    [4]    16  20     5
>>>>>>    [5]    19  23     5
>>>>>>
>>>>>> We could to the same thing for the mapping from original to reduced
>>>>>> ranges with e.g. an argument called 'with.mapping.attrib'.
>>>>>> Would that work?
>>>>>>
>>>>>> Cheers,
>>>>>> H.
>>>>>>
>>>>>>
>>>>>> On 03/15/2012 05:44 AM, Kasper Daniel Hansen wrote:
>>>>>>
>>>>>>>
>>>>>>> So the key question is to what extent keeping track of where the
>>>>>>> ranges comes from would slow down the reduce operation.  I am not
>>>>>>> familiar enough with the algorithm to know this, but given how fast
>>>>>>> IRanges is in general, I am not one for guessing on this.
>>>>>>>
>>>>>>> I agree with Florian that this is a very typical use case.
>>>>>>>
>>>>>>> Kasper
>>>>>>>
>>>>>>> On Thu, Mar 15, 2012 at 5:02 AM, Hahne, Florian
>>>>>>> <florian.hahne at novartis.com>     wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> It is true that this is not terribly slow when you deal with
>>>>>>>>fairly
>>>>>>>> large
>>>>>>>> range objects:
>>>>>>>>
>>>>>>>> foo<- GRanges(seqnames=sample(1:4, 1e6, TRUE),
>>>>>>>> ranges=IRanges(start=as.**integer(runif(min=1, max=1e7, n=1e6)),
>>>>>>>>
>>>>>>>
>>>>>> width=50))
>>>>>>
>>>>>>>
>>>>>>>> system.time(bar<- reduce(foo))
>>>>>>>>    user  system elapsed
>>>>>>>>   0.918   0.174   1.091
>>>>>>>>
>>>>>>>> system.time(foobar<- findOverlaps(foo, bar))
>>>>>>>>    user  system elapsed
>>>>>>>>   2.051   0.402   2.453
>>>>>>>>
>>>>>>>>
>>>>>>>> However the whole process does take about 3x the time of just the
>>>>>>>>
>>>>>>>
>>>>>> reduce
>>>>>>
>>>>>>>
>>>>>>>> operation, and in my use case I want this to happen interactively,
>>>>>>>> where
>>>>>>>> waiting 3 seconds compared to 1 makes a huge difference...
>>>>>>>>
>>>>>>>> I wouldn't push this high up on the development agenda, but it
>>>>>>>>seems
>>>>>>>> to
>>>>>>>>
>>>>>>>
>>>>>> be
>>>>>>
>>>>>>>
>>>>>>>> something that is already 95% existing and could easily be added.
>>>>>>>>But
>>>>>>>> maybe I am wrong...
>>>>>>>>
>>>>>>>> Florian
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Florian Hahne
>>>>>>>> Novartis Institute For Biomedical Research
>>>>>>>> Translational Sciences / Preclinical Safety / PCS Informatics
>>>>>>>> Expert Data Integration and Modeling Bioinformatics
>>>>>>>> CHBS, WKL-135.2.26
>>>>>>>> Novartis Institute For Biomedical Research, Werk Klybeck
>>>>>>>> Klybeckstrasse 141
>>>>>>>> CH-4057 Basel
>>>>>>>> Switzerland
>>>>>>>> Phone: +41 61 6967127
>>>>>>>> Email : florian.hahne at novartis.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/14/12 9:40 PM, "Kasper Daniel
>>>>>>>>
>>>>>>>
>>>>>> Hansen"<kasperdanielhansen@**gmail.com
>>>>>><kasperdanielhansen at gmail.com>>
>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  We have discussed this a couple of times.  I routinely uses the
>>>>>>>>> reduce
>>>>>>>>> followed by findOverlaps paradigm.  As Malcolm says it feels
>>>>>>>>>wrong,
>>>>>>>>> but from a practical point of view it is pretty fast, so I
>>>>>>>>>stopped
>>>>>>>>> worrying about it.  I only think there is a reason to do this,
>>>>>>>>>if it
>>>>>>>>> is substantially faster.
>>>>>>>>>
>>>>>>>>> Kasper
>>>>>>>>>
>>>>>>>>> On Wed, Mar 14, 2012 at 3:46 PM, Cook, Malcolm<MEC at stowers.org>
>>>>>>>>>
>>>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>>>> Chiming in....
>>>>>>>>>>
>>>>>>>>>> on a similar note....
>>>>>>>>>>
>>>>>>>>>> A version of `disjoin` which returns a Hits/RangesMapping
>>>>>>>>>> additional
>>>>>>>>>> to
>>>>>>>>>> the GRanges result would be most useful  and probably not
>>>>>>>>>>require
>>>>>>>>>>
>>>>>>>>>
>>>>>> much
>>>>>>
>>>>>>>
>>>>>>>>>> additional effort (assuming `disjoin` computes this internally)
>>>>>>>>>>
>>>>>>>>>> Of course, it is easy to live without since I can just perform
>>>>>>>>>>the
>>>>>>>>>> findOverlaps myself after the disjoin.... it just "feels wrong"
>>>>>>>>>> (tm)
>>>>>>>>>>
>>>>>>>>>> Ahoy!
>>>>>>>>>>
>>>>>>>>>> ~Malcolm
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  -----Original Message-----
>>>>>>>>>>> From: 
>>>>>>>>>>>bioc-devel-bounces at r-project.**org<bioc-devel-bounces at r-project.
>>>>>>>>>>>org>[mailto:
>>>>>>>>>>> bioc-devel-
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> bounces at r-
>>>>>>
>>>>>>>
>>>>>>>>>>> project.org] On Behalf Of Hahne, Florian
>>>>>>>>>>> Sent: Wednesday, March 14, 2012 2:22 PM
>>>>>>>>>>> To: bioc-devel at r-project.org
>>>>>>>>>>> Subject: [Bioc-devel] mapping between original and reduced
>>>>>>>>>>>ranges
>>>>>>>>>>>
>>>>>>>>>>> This bounced before, guess the mailing list does not like HTML
>>>>>>>>>>> mails.
>>>>>>>>>>> So
>>>>>>>>>>> one more try:
>>>>>>>>>>>
>>>>>>>>>>> I had the following offline discussion with Michael about how
>>>>>>>>>>>one
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> could
>>>>>>
>>>>>>>
>>>>>>>>>>> retain a mapping of the ranges in a GRanges object before and
>>>>>>>>>>> after
>>>>>>>>>>> reduce. He suggested to take it to the list. Is that something
>>>>>>>>>>> that
>>>>>>>>>>> could
>>>>>>>>>>> be added to GenomicRanges/IRanges?
>>>>>>>>>>> Florian
>>>>>>>>>>>
>>>>>>>>>>> I have a slightly tricky application for which I need to
>>>>>>>>>>>reduce a
>>>>>>>>>>> GRanges
>>>>>>>>>>> object, but I would like to be able to process some of the
>>>>>>>>>>> original
>>>>>>>>>>> elementMetadata of the merged ranges later. The only way I was
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> able to
>>>>>>
>>>>>>>
>>>>>>>>>>> figure out which of the original ranges correspond to the
>>>>>>>>>>>merged
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> ranges
>>>>>>
>>>>>>>
>>>>>>>>>>> was to perform a findOverlaps operation, but of course that is
>>>>>>>>>>> rather
>>>>>>>>>>> costly. Is there a way to get the merge information out of the
>>>>>>>>>>> original
>>>>>>>>>>> reduce call?
>>>>>>>>>>> Here is a brief example:
>>>>>>>>>>>
>>>>>>>>>>> gr<- GRanges(seqnames="chr1",
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> ranges=IRanges(start=c(1,6,12,**24,27),
>>>>>>
>>>>>>>
>>>>>>>>>>> width=5), foo=1:5, bar=letters[1:5])
>>>>>>>>>>> gr2<- reduce(gr, min.gapwidth=1)
>>>>>>>>>>> ind<- queryHits(findOverlaps(gr2, gr))
>>>>>>>>>>> split(values(gr), ind)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Unfortunately, this is the idiom. I could see an improvement
>>>>>>>>>>>where
>>>>>>>>>>> reduce
>>>>>>>>>>> or a similarly named function would return a Hits object (in
>>>>>>>>>>> addition
>>>>>>>>>>> to
>>>>>>>>>>> the actual reduce result) that would indicate the mapping
>>>>>>>>>>>between
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> the
>>>>>>
>>>>>>>
>>>>>>>>>>> input and reduced ranges. The RangesMapping structure would be
>>>>>>>>>>>
>>>>>>>>>>
>>>>>> really
>>>>>>
>>>>>>>
>>>>>>>>>>> close to what we would need.
>>>>>>>>>>>
>>>>>>>>>>> Michael
>>>>>>>>>>>
>>>>>>>>>>> ______________________________**_________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> 
>>>>>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.
>>>>>>>>>>>ethz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ______________________________**_________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> 
>>>>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.e
>>>>>>>>>>thz.ch/mailman/listinfo/bioc-devel>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> 
>>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz
>>>>>>>.ch/mailman/listinfo/bioc-devel>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpages at fhcrc.org
>>>>>> Phone:  (206) 667-5791
>>>>>> Fax:    (206) 667-1319
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> 
>>>>>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.
>>>>>>ch/mailman/listinfo/bioc-devel>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>>>
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> ______________________________**_________________
>> Bioc-devel at r-project.org mailing list
>> 
>>https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/m
>>ailman/listinfo/bioc-devel>
>>
>
>	[[alternative HTML version deleted]]
>
>_______________________________________________
>Bioc-devel at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list