[BioC] Getting the length of every element from a large CompressedIRangesList is slow
Hervé Pagès
hpages at fhcrc.org
Mon Jul 2 20:25:35 CEST 2012
Hi Nico,
Even faster:
> system.time(sizes <- elementLengths(exbytx))
user system elapsed
0.000 0.000 0.001
Note that you can use elementLengths on any list-like object
("list-like" = list or List class or subclass):
> x <- rep(list(a=1:4, b=letters), 500000)
> length(x)
[1] 1000000
> system.time(x_eltlens <- sapply(x, length))
user system elapsed
3.132 0.008 3.142
> system.time(x_eltlens2 <- elementLengths(x))
user system elapsed
0.024 0.000 0.023
> identical(x_eltlens, x_eltlens2)
[1] TRUE
HTH,
H.
On 07/02/2012 10:18 AM, Nicolas Delhomme wrote:
> Hi,
>
> Just to extend on my previous message:
>
> Doing this instead is fast:
>
>> system.time(sizes <- sapply(width(aln.ranges),length))
>
> user system elapsed
> 1.109 0.144 1.254
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On Jul 2, 2012, at 7:02 PM, Nicolas Delhomme wrote:
>
>> Hej!
>>
>> I've a rather large CompressedIRangesList
>>
>>> print(object.size(aln.ranges),unit="Mb")
>> 390.4 Mb
>>
>> that has 2518 elements, some of which having up to 6M ranges for a total of 51M, but the vast majority are small, the median is 2 while the mean is ~ 20,000 (the 3rd quartile has a value of 47).
>>
>> Retrieving the element length is slow:
>>
>>> system.time(sizes <- sapply(aln.ranges,length))
>>
>> user system elapsed
>> 265.777 169.222 443.498
>>
>> by comparison to the performances of the IRanges package in general, which I was surprised of. Are there faster way to get this information than the sapply I'm using? Note that the machine I'm using is not a limiting factor in terms of CPU/RAM/load.
>>
>>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] C/UTF-8/C/C/C/C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] IRanges_1.15.15 BiocGenerics_0.3.0
>>
>> loaded via a namespace (and not attached):
>> [1] stats4_2.15.1
>>
>> Nico
>>
>> P.S. If you need, I can send my aln.ranges object off-list.
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list