[Bioc-devel] GRanges constructor seems to be pretty sluggish
Martin Morgan
mtmorgan at fhcrc.org
Wed Jan 12 19:50:57 CET 2011
On 1/12/2011 10:30 AM, florian.hahne at novartis.com wrote:
> Hi List, Patrick, Martin,
> I just realized that creating GRanges objects with many (millions) of
> ranges takes forever compared to the creation of an IRanges object.
> Here are some numbers:
>
> > system.time(ir <- IRanges(start=1:1e6, width=5))
> user system elapsed
> 0.053 0.000 0.052
> > system.time(gr <- GRanges(ranges=ir, seqnames=seq_len(length(ir))))
> user system elapsed
> 5.238 0.167 5.455
A big part of this is the uniqueness of the 'seqnames', e.g.,
> system.time(gr <- GRanges("chr1", ranges=ir, "*"))
user system elapsed
0.044 0.000 0.041
> seqnames <- sample(LETTERS, length(ir), TRUE)
> system.time(gr <- GRanges(seqnames, ranges=ir, "*"))
user system elapsed
1.045 0.000 1.043
> system.time(gr <- GRanges(seq_len(length(ir)), ranges=ir, "*"))
user system elapsed
6.412 0.004 6.422
It's possible to apply names to IRanges, so if this were for tracking
purposes one might
> ir <- IRanges(start=1:1e6, width=5, names=rev(seq_len(1e6)))
> system.time(gr <- GRanges("chrA", ranges=ir, "*"))
user system elapsed
0.860 0.000 0.862
Also one does get a benefit here, in terms of paying the cost of
creating a factor up-front and amortizing that cost across use.
Thanks for pointing to a bottleneck.
Martin
>
> For my application this is pretty much a killer. I could move things
> over to IRanges objects whenever possible, but this is not the
> cleanest solution because in all my S4 classes I would have to use
> class unions since GRanges does not inherit from IRanges. All a big
> mess... I tried to debug the issue a little and there seem to be a
> bunch of very inefficient code lines in the extended constructor
> newGRanges, coercions of Rles to factors and so on. Also there seems
> to be some name checking of the seqnames argumen going on, which is
> also very slow. Maybe these are all necessary steps and there is no
> chance for further optimization, but maybe these issues have just been
> overlooked before because nobody ever needed so many regions in there.
> Any ideas?
> Florian
>
> Best regards, Mit freundlichen Grüssen, Meilleures salutations,
>
> *Florian Hahne
> Novartis Institute For Biomedical Research**
> Translational Sciences / PreClinical Safety / investigativeToxicology
> (iTOX)*
> Expert Data Integration and Modeling Bioinformatics
> CHBS, WKL-135.1.67
> Novartis Institute For Biomedical Research, Werk Klybeck
> Klybeckstrasse 141
> CH-4057 Basel
> Switzerland
> Phone: +41 61 6967127
> Email : _florian.hahne at novartis.com_ <mailto:florian.hahne at novartis.com>
>
>
>
--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
More information about the Bioc-devel
mailing list