[Rd] execution time of .packages

Tue Mar 3 11:48:18 CET 2009

Prof Brian Ripley wrote:
> The caching is in the disc system: you need to find and read the 
> package metadata for every package.  AFAIK it is not easy to flush the 
> disc cache, but quite easy to overwrite it with later reads.  (Google 
> for more info.)
Thanks for the info, I'll try to find my way with these directions.

> If you are not concerned about validity of the installed packages you 
> could skip the tests and hence the reads.
>
> Your times are quite a bit slower than mine, so a faster disc system 
> might help.  Since my server has just been rebooted (for a new 
> kernel), with all of CRAN and most of BioC I get
>
>> system.time( packs <- .packages( all = T ) )
>    user  system elapsed
>   0.518   0.262  25.042
>> system.time( packs <- .packages( all = T ) )
>    user  system elapsed
>   0.442   0.080   0.522
>> length(packs)
> [1] 2096
>
> There's a similar issue when installing packages: the Perl code reads 
> the indices from every visible package to resolve links, and that can 
> be slow the first time.
>
>
> On Tue, 3 Mar 2009, Romain Francois wrote:
>
>> Hello,
>>
>> The first time in a session I call .packages( all.available = T ), it 
>> takes a long time (I have many packages installed from CRAN):
>>
>>> system.time( packs <- .packages( all = T ) )
>> user  system elapsed
>> 0.738   0.276  43.787
>>
>> When I call it again, the time is now much reduced, so there must be 
>> some caching somewhere. I would like to try to reduce the  time it 
>> takes the first time, but I have not been able to identify where the 
>> caching takes place, and so how I can remove it to try to improve the 
>> running time without the caching. Without this, I have to restart my 
>> computer each time to vanish the caching to test a new version of the 
>> function (this is not going to happen)
>>
>> Here is the .packages function, I am suspicious about this part : 
>> "ans <- c(ans, nam)" which grows the ans vector each time a suitable 
>> package is found, this does not sound right.
>
> It's OK as there are only going to be ca 2000 packages.  Try profiling 
> this: .readRDS and grepl take most of the time.
I usually do not trust the result of the profiler when a for loop is 
involved, as it tends to miss the point (or maybe I am).

Consider this script below, the profiler reports 0.22 seconds when the 
actual time spent is about 6 seconds,  and would blame rnorm as the 
bottleneck when the inefficiency is in with growing the data structure.

Rprof( )
x <- numeric( )
for( i in 1:10000){
  x <- c( x, rnorm(10) )
}
Rprof( NULL )
print( summaryRprof( ) )

$ time Rscript --vanilla profexample.R
$by.self
        self.time self.pct total.time total.pct
"rnorm"      0.22      100       0.22       100

$by.total
        total.time total.pct self.time self.pct
"rnorm"       0.22       100      0.22      100

$sampling.time
[1] 0.22

real    0m6.164s
user    0m5.156s
sys     0m0.737s

$ time Rscript --vanilla -e "rnorm(10)"
 [1]  0.836411851  1.762081444  1.076305644  2.063515383  0.643254750
 [6]  1.698620443 -1.774479062 -0.432886214 -0.007949533  0.284089832

real    0m0.224s
user    0m0.187s
sys     0m0.024s

Now, if i replace the for loop with a similar silly lapply construct, 
profiler tells me a rather different story:

Rprof( )
x <- numeric( )
y <- lapply( 1:10000, function(i){
    x <<- c( x, rnorm(10) )
    NULL
} )
Rprof( NULL )
print( summaryRprof( ) )

$ time Rscript --vanilla prof2.R
$by.self                                           
         self.time self.pct total.time total.pct   
"FUN"         6.48     96.1       6.68      99.1   
"rnorm"       0.20      3.0       0.20       3.0
"lapply"      0.06      0.9       6.74     100.0

$by.total
         total.time total.pct self.time self.pct
"lapply"       6.74     100.0      0.06      0.9
"FUN"          6.68      99.1      6.48     96.1
"rnorm"        0.20       3.0      0.20      3.0

$sampling.time
[1] 6.74

real    0m8.352s
user    0m4.762s
sys     0m2.574s

Or let us wrap the for loop of the first example in a function:

Rprof( )
x <- numeric( )
ffor <- function(){
    for( i in 1:10000){
      x <- c( x, rnorm(10) )
    }
}
ffor()
Rprof( NULL )
print( summaryRprof( ) )

$ time Rscript --vanilla prof3.R
$by.self
        self.time self.pct total.time total.pct
"ffor"        5.4     96.4        5.6     100.0
"rnorm"       0.2      3.6        0.2       3.6

$by.total
        total.time total.pct self.time self.pct
"ffor"         5.6     100.0       5.4     96.4
"rnorm"        0.2       3.6       0.2      3.6

$sampling.time
[1] 5.6

real    0m6.379s
user    0m5.408s
sys     0m0.717s

Maybe I get this all wrong, maybe the global assignment operator is 
responsible for some of the time in the second example. But how can I 
analyse the result of profiler in the first example when it seems to 
only be interested in the .22 seconds when I want to know what is going 
on with the rest of the time.

Is it possible to treat "for" as a function when writing the profiler 
data so that I can trust it more ?

Romain

-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr