[Rd] execution time of .packages
Romain Francois
romain.francois at dbmail.com
Tue Mar 3 11:48:18 CET 2009
Prof Brian Ripley wrote:
> The caching is in the disc system: you need to find and read the
> package metadata for every package. AFAIK it is not easy to flush the
> disc cache, but quite easy to overwrite it with later reads. (Google
> for more info.)
Thanks for the info, I'll try to find my way with these directions.
> If you are not concerned about validity of the installed packages you
> could skip the tests and hence the reads.
>
> Your times are quite a bit slower than mine, so a faster disc system
> might help. Since my server has just been rebooted (for a new
> kernel), with all of CRAN and most of BioC I get
>
>> system.time( packs <- .packages( all = T ) )
> user system elapsed
> 0.518 0.262 25.042
>> system.time( packs <- .packages( all = T ) )
> user system elapsed
> 0.442 0.080 0.522
>> length(packs)
> [1] 2096
>
> There's a similar issue when installing packages: the Perl code reads
> the indices from every visible package to resolve links, and that can
> be slow the first time.
>
>
> On Tue, 3 Mar 2009, Romain Francois wrote:
>
>> Hello,
>>
>> The first time in a session I call .packages( all.available = T ), it
>> takes a long time (I have many packages installed from CRAN):
>>
>>> system.time( packs <- .packages( all = T ) )
>> user system elapsed
>> 0.738 0.276 43.787
>>
>> When I call it again, the time is now much reduced, so there must be
>> some caching somewhere. I would like to try to reduce the time it
>> takes the first time, but I have not been able to identify where the
>> caching takes place, and so how I can remove it to try to improve the
>> running time without the caching. Without this, I have to restart my
>> computer each time to vanish the caching to test a new version of the
>> function (this is not going to happen)
>>
>> Here is the .packages function, I am suspicious about this part :
>> "ans <- c(ans, nam)" which grows the ans vector each time a suitable
>> package is found, this does not sound right.
>
> It's OK as there are only going to be ca 2000 packages. Try profiling
> this: .readRDS and grepl take most of the time.
I usually do not trust the result of the profiler when a for loop is
involved, as it tends to miss the point (or maybe I am).
Consider this script below, the profiler reports 0.22 seconds when the
actual time spent is about 6 seconds, and would blame rnorm as the
bottleneck when the inefficiency is in with growing the data structure.
Rprof( )
x <- numeric( )
for( i in 1:10000){
x <- c( x, rnorm(10) )
}
Rprof( NULL )
print( summaryRprof( ) )
$ time Rscript --vanilla profexample.R
$by.self
self.time self.pct total.time total.pct
"rnorm" 0.22 100 0.22 100
$by.total
total.time total.pct self.time self.pct
"rnorm" 0.22 100 0.22 100
$sampling.time
[1] 0.22
real 0m6.164s
user 0m5.156s
sys 0m0.737s
$ time Rscript --vanilla -e "rnorm(10)"
[1] 0.836411851 1.762081444 1.076305644 2.063515383 0.643254750
[6] 1.698620443 -1.774479062 -0.432886214 -0.007949533 0.284089832
real 0m0.224s
user 0m0.187s
sys 0m0.024s
Now, if i replace the for loop with a similar silly lapply construct,
profiler tells me a rather different story:
Rprof( )
x <- numeric( )
y <- lapply( 1:10000, function(i){
x <<- c( x, rnorm(10) )
NULL
} )
Rprof( NULL )
print( summaryRprof( ) )
$ time Rscript --vanilla prof2.R
$by.self
self.time self.pct total.time total.pct
"FUN" 6.48 96.1 6.68 99.1
"rnorm" 0.20 3.0 0.20 3.0
"lapply" 0.06 0.9 6.74 100.0
$by.total
total.time total.pct self.time self.pct
"lapply" 6.74 100.0 0.06 0.9
"FUN" 6.68 99.1 6.48 96.1
"rnorm" 0.20 3.0 0.20 3.0
$sampling.time
[1] 6.74
real 0m8.352s
user 0m4.762s
sys 0m2.574s
Or let us wrap the for loop of the first example in a function:
Rprof( )
x <- numeric( )
ffor <- function(){
for( i in 1:10000){
x <- c( x, rnorm(10) )
}
}
ffor()
Rprof( NULL )
print( summaryRprof( ) )
$ time Rscript --vanilla prof3.R
$by.self
self.time self.pct total.time total.pct
"ffor" 5.4 96.4 5.6 100.0
"rnorm" 0.2 3.6 0.2 3.6
$by.total
total.time total.pct self.time self.pct
"ffor" 5.6 100.0 5.4 96.4
"rnorm" 0.2 3.6 0.2 3.6
$sampling.time
[1] 5.6
real 0m6.379s
user 0m5.408s
sys 0m0.717s
Maybe I get this all wrong, maybe the global assignment operator is
responsible for some of the time in the second example. But how can I
analyse the result of profiler in the first example when it seems to
only be interested in the .22 seconds when I want to know what is going
on with the rest of the time.
Is it possible to treat "for" as a function when writing the profiler
data so that I can trust it more ?
Romain
--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
More information about the R-devel
mailing list