[BioC] GRanges performance issue
Hervé Pagès
hpages at fhcrc.org
Fri Jul 8 10:29:24 CEST 2011
Hi Arne,
On 11-07-07 08:45 AM, Mueller, Arne wrote:
> Hello,
>
> I realized there's a massive performance difference to subset Granges objects by name compared to the Granges subset method.
>
> Example:
>
>> length(mm9.tiled)
> [1] 5309835
>> n = names(mm9.tiled)
>> rn = sample(n, 1000)
>> system.time(tmp<- subset(mm9.tiled, names(mm9.tiled) %in% rn))
> user system elapsed
> 1.610 0.131 1.741
>> system.time(tmp<- mm9.tiled[rn])
> user system elapsed
> 72.793 0.167 72.976
Note that subsetting with
mm9.tiled[rn] # A
is not the same as subsetting with
mm9.tiled[names(mm9.tiled) %in% rn] # B
because the latter does not reorder the elements.
An equivalent to A would rather be
mm9.tiled[match(rn, names(mm9.tiled)] # C
and yes, C is also much faster than A (50x faster on my machine
for a GRanges with 1 million elts). I agree that this can hardly
be justified: I don't see any reason why A couldn't be made as fast
as C (or almost). I believe the culprit is the call to
IRanges:::.bracket.Index() in the "[" method for "GRanges"
objects. I'll try to come up with a fix.
Thanks for reporting this.
H.
>>
>> sessionInfo()
> R version 2.14.0 Under development (unstable) (2011-06-01 r56028)
> Platform: x86_64-unknown-linux-gnu/x86_64 (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices datasets utils methods base
>
> other attached packages:
> [1] GenomicRanges_1.5.12 IRanges_1.11.10
>
> loaded via a namespace (and not attached):
> [1] tools_2.14.0
>
>
> Is this a known (wanted?) behavior?
>
> Regards,
>
> Arne
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list