[Bioc-devel] RFC: IntervalTrees for GRanges objects

Martin Morgan mtmorgan at fhcrc.org
Wed Apr 3 20:12:40 CEST 2013


On 04/03/2013 10:28 AM, Kasper Daniel Hansen wrote:
> Making IntervalTree chromosome would also be a great addition for organisms
> with many sequences, like bee (due to an incomplete genome; 10,000s of
> sequences).  It does not matter for humans, but findOverlaps is
> excruciatingly slow for bee's.  I have a couple of posts on this in the
> archive.
>
> I am predicting this will be the case in the future for most non-model
> organisms; finishing a genome is expensive and time consuming.

A cheap (both computationally and philosophically) hack is along the lines of

gr <- GRanges(c("A", "B", "A"), IRanges(c(2000, 1000, 3000), width=100))

rng <- range(gr)
off0 <- (width(rng)[-length(rng)] + 1L)
offset <- setNames(c(1L, cumsum(off0)) - start(rng), seqnames(rng))
shift(ranges(gr), offset[as.character(seqnames(gr))])

which shifts the ranges to be non-overlapping by seqname, so findOverlaps can 
operate on the IRanges directly. This is a trick I used in selectMethod(disjoin, 
"CompressedIRangesList"). There's the risk of integer overflow (when 
sum(width(rng) + 1L) > .Machine$integer.max) but this could be handled (if it 
occurs!) by partitioning into .Machine$integer.max-sized ranges. Unless IRanges 
were updated to support longer integers...

Martin

>
> Kasper
>
>
> On Wed, Apr 3, 2013 at 12:45 PM, Michael Lawrence <lawrence.michael at gene.com
>> wrote:
>
>> Some ideas:
>>
>> - Turn the IntervalTree into a list/array of nodes that can be
>> subset/reordered with shallow copying (just copy the pointers to the
>> nodes), and the index would be secondary. The index in the array could be
>> stored in each node, for lookup during overlap queries. Right now, as far
>> as I can tell, GIntervalTree will get confused if the user reorders e.g.
>> via [.
>>
>> - Make IntervalTree aware of the sequence/chromosome, e.g., have a hash of
>> trees, which is trivial since seqnames is already a factor.
>>
>> Michael
>>
>>
>>
>> On Wed, Apr 3, 2013 at 9:29 AM, Hector Corrada Bravo <hcorrada at gmail.com
>>> wrote:
>>
>>> Yep, I didn't comment on that, but I agree that abstracting how
>>> GRanges stores ranges would make this more elegant. Right now
>>> ranges(GRanges) is specified to be of IRanges class instead of the
>>> abstract Ranges class.
>>>
>>> If it were the latter then GIntervalTree can be a subclass of
>>> GenomicRanges, in a similar way that IntervalTree is a subclass of
>>> Ranges.
>>>
>>> On Wed, Apr 3, 2013 at 12:23 PM, Michael Lawrence
>>> <lawrence.michael at gene.com> wrote:
>>>> Hi Hector,
>>>>
>>>> That's interesting, thanks for passing this along. I'm still wishing
>> that
>>>> somehow GRanges itself could abstract the way it stores ranges. I know
>>> that
>>>> Herve/Patrick had some reasons for depending specifically on GRanges.
>> One
>>>> reason was probably convenience at the C level, but it wouldn't be hard
>>> to
>>>> create a Ranges abstraction at the C level, as well.
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> On Tue, Apr 2, 2013 at 5:40 PM, Hector Corrada Bravo <
>> hcorrada at gmail.com
>>>>
>>>> wrote:
>>>>>
>>>>> Hello bioc-develers,
>>>>>
>>>>> I'm writing an application where lots findOverlap calls are made on
>>>>> static GRanges objects. For IRanges we can create persistent
>>>>> IntervalTree objects that would serve the multiple overlap query
>>>>> use-case. There is no equivalent for GenomicRanges objects, so I'm
>>>>> proposing an implementation for this.
>>>>>
>>>>> Please check
>>>>> http://github.com/hcorrada/GenomicIntervalTree
>>>>>
>>>>> There's a first cut implementation there you can test by installing
>>>>> this skeleton package. E.g,
>>>>>
>>>>>> library(devtools)
>>>>>> install_github("GenomicIntervalTree", username="hcorrada",
>>> subdir="pkg")
>>>>>> library(GenomicIntervalTree)
>>>>>
>>>>> Let me know what you think.
>>>>>
>>>>> Cheers,
>>>>> Hector
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list