[Bioc-devel] GRanges constructor seems to be pretty sluggish

Martin Morgan mtmorgan at fhcrc.org
Wed Jan 12 19:50:57 CET 2011


On 1/12/2011 10:30 AM, florian.hahne at novartis.com wrote:
> Hi List, Patrick, Martin,
> I just realized that creating GRanges objects with many (millions) of 
> ranges takes forever compared to the creation of an IRanges object. 
> Here are some numbers:
>
> > system.time(ir <- IRanges(start=1:1e6, width=5))
>    user  system elapsed
>   0.053   0.000   0.052
> > system.time(gr <- GRanges(ranges=ir, seqnames=seq_len(length(ir))))
>    user  system elapsed
>   5.238   0.167   5.455

A big part of this is the uniqueness of the 'seqnames', e.g.,

 > system.time(gr <- GRanges("chr1", ranges=ir, "*"))
    user  system elapsed
   0.044   0.000   0.041
 > seqnames <- sample(LETTERS, length(ir), TRUE)
 > system.time(gr <- GRanges(seqnames, ranges=ir, "*"))
    user  system elapsed
   1.045   0.000   1.043
 > system.time(gr <- GRanges(seq_len(length(ir)), ranges=ir, "*"))
    user  system elapsed
   6.412   0.004   6.422

It's possible to apply names to IRanges, so if this were for tracking  
purposes one might

 > ir <- IRanges(start=1:1e6, width=5, names=rev(seq_len(1e6)))
 > system.time(gr <- GRanges("chrA", ranges=ir, "*"))
    user  system elapsed
   0.860   0.000   0.862

Also one does get a benefit here, in terms of paying the cost of 
creating a factor up-front and amortizing that cost across use.

Thanks for pointing to a bottleneck.

Martin
>
> For my application this is pretty much a killer. I could move things 
> over to IRanges objects whenever possible, but this is not the 
> cleanest solution because in all my S4 classes I would have to use 
> class unions since GRanges does not inherit from IRanges. All a big 
> mess... I tried to debug the issue a little and there seem to be a 
> bunch of very inefficient code lines in the extended constructor 
> newGRanges, coercions of Rles to factors and so on. Also there seems 
> to be some name checking of the seqnames argumen going on, which is 
> also very slow. Maybe these are all necessary steps and there is no 
> chance for further optimization, but maybe these issues have just been 
> overlooked before because nobody ever needed so many regions in there. 
> Any ideas?
> Florian
>
> Best regards, Mit freundlichen Grüssen, Meilleures salutations,
>
> *Florian Hahne
> Novartis Institute For Biomedical Research**
> Translational Sciences / PreClinical Safety / investigativeToxicology 
> (iTOX)*
> Expert Data Integration and Modeling Bioinformatics
> CHBS, WKL-135.1.67
> Novartis Institute For Biomedical Research, Werk Klybeck
> Klybeckstrasse 141
> CH-4057 Basel
> Switzerland
> Phone: +41 61 6967127
> Email : _florian.hahne at novartis.com_ <mailto:florian.hahne at novartis.com>
>
>
>


-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109



More information about the Bioc-devel mailing list