[Rd] execution time of .packages

Prof Brian Ripley ripley at stats.ox.ac.uk
Tue Mar 3 13:41:44 CET 2009


Let me repeat: what is happening for me in the equivalent of your 
35.589 - 1.18 seconds is that R is waiting for my OS to read its discs 
(and they can be heard chuntering away).  As the R process is not 
runniing at those times, the profiler is not running either (on a 
Unix-alike: on Windows the profiler does measure elapsed time).  I 
expect it will be the same explanation for you.

What I have already suggested is that if you want to save time, do not 
read and check the package.rds files.   As far as I can see they were 
checked at installation in any recent version of R.  Just check their 
existence.

On Tue, 3 Mar 2009, Romain Francois wrote:

> Prof Brian Ripley wrote:
>> On Tue, 3 Mar 2009, Romain Francois wrote:
>> 
>>> Prof Brian Ripley wrote:
>>>> The caching is in the disc system: you need to find and read the package 
>>>> metadata for every package.  AFAIK it is not easy to flush the disc 
>>>> cache, but quite easy to overwrite it with later reads.  (Google for more 
>>>> info.)
>>> Thanks for the info, I'll try to find my way with these directions.
>>> 
>>>> If you are not concerned about validity of the installed packages you 
>>>> could skip the tests and hence the reads.
>>>> 
>>>> Your times are quite a bit slower than mine, so a faster disc system 
>>>> might help.  Since my server has just been rebooted (for a new kernel), 
>>>> with all of CRAN and most of BioC I get
>>>> 
>>>>> system.time( packs <- .packages( all = T ) )
>>>>    user  system elapsed
>>>>   0.518   0.262  25.042
>>>>> system.time( packs <- .packages( all = T ) )
>>>>    user  system elapsed
>>>>   0.442   0.080   0.522
>>>>> length(packs)
>>>> [1] 2096
>>>> 
>>>> There's a similar issue when installing packages: the Perl code reads the 
>>>> indices from every visible package to resolve links, and that can be slow 
>>>> the first time.
>>>> 
>>>> 
>>>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> The first time in a session I call .packages( all.available = T ), it 
>>>>> takes a long time (I have many packages installed from CRAN):
>>>>> 
>>>>>> system.time( packs <- .packages( all = T ) )
>>>>> user  system elapsed
>>>>> 0.738   0.276  43.787
>>>>> 
>>>>> When I call it again, the time is now much reduced, so there must be 
>>>>> some caching somewhere. I would like to try to reduce the  time it takes 
>>>>> the first time, but I have not been able to identify where the caching 
>>>>> takes place, and so how I can remove it to try to improve the running 
>>>>> time without the caching. Without this, I have to restart my computer 
>>>>> each time to vanish the caching to test a new version of the function 
>>>>> (this is not going to happen)
>>>>> 
>>>>> Here is the .packages function, I am suspicious about this part : "ans 
>>>>> <- c(ans, nam)" which grows the ans vector each time a suitable package 
>>>>> is found, this does not sound right.
>>>> 
>>>> It's OK as there are only going to be ca 2000 packages.  Try profiling 
>>>> this: .readRDS and grepl take most of the time.
>> 
>>> I usually do not trust the result of the profiler when a for loop is 
>>> involved, as it tends to miss the point (or maybe I am).
>> 
>> Here are the data for the actual example (repeated for this message):
>> 
>>> Rprof()
>>> system.time( packs <- .packages( all = T ) )
>>    user  system elapsed
>>   0.447   0.078   0.525
>>> Rprof(NULL)
>>> summaryRprof()
>> $by.self
>>                    self.time self.pct total.time total.pct
>> "grepl"                 0.18     34.6       0.18      34.6
>> ".readRDS"              0.12     23.1       0.20      38.5
>> ".packages"             0.08     15.4       0.50      96.2
>> "close.connection"      0.04      7.7       0.04       7.7
>> "close"                 0.02      3.8       0.06      11.5
>> "file.exists"           0.02      3.8       0.02       3.8
>> "gc"                    0.02      3.8       0.02       3.8
>> "gzfile"                0.02      3.8       0.02       3.8
>> "list"                  0.02      3.8       0.02       3.8
>> "system.time"           0.00      0.0       0.52     100.0
>> "file.path"             0.00      0.0       0.02       3.8
>> 
>> $by.total
>>                    total.time total.pct self.time self.pct
>> "system.time"            0.52     100.0      0.00      0.0
>> ".packages"              0.50      96.2      0.08     15.4
>> ".readRDS"               0.20      38.5      0.12     23.1
>> "grepl"                  0.18      34.6      0.18     34.6
>> "close"                  0.06      11.5      0.02      3.8
>> "close.connection"       0.04       7.7      0.04      7.7
>> "file.exists"            0.02       3.8      0.02      3.8
>> "gc"                     0.02       3.8      0.02      3.8
>> "gzfile"                 0.02       3.8      0.02      3.8
>> "list"                   0.02       3.8      0.02      3.8
>> "file.path"              0.02       3.8      0.00      0.0
>> 
>> $sampling.time
>> [1] 0.52
>> 
>> there is little tiime unaccounted for, and 0.38 sec is going in .readRDS 
>> and grepl.  Whereas
>> 
>> system.time({
>> ans <- character(0)
>> for(i in 1:2096) ans <- c(ans, "foo")
>> })
>> 
>> takes 0.024 secs, negligible here (one profiler tick).
> Here is what happens to me if I restart the computer:
>
>> Rprof( )
>> system.time( packs <- .packages( all = T ) )
>  user  system elapsed
> 0.888   0.342  35.589
>> Rprof(NULL)
>> summaryRprof()
> $by.self
>                  self.time self.pct total.time total.pct
> ".readRDS"              0.34     28.8       0.64      54.2
> ".packages"             0.14     11.9       1.16      98.3
> "file.exists"           0.14     11.9       0.14      11.9
> "gzfile"                0.12     10.2       0.16      13.6
> "close"                 0.10      8.5       0.14      11.9
> "grepl"                 0.08      6.8       0.10       8.5
> "$"                     0.08      6.8       0.08       6.8
> "file.path"             0.06      5.1       0.06       5.1
> "close.connection"      0.04      3.4       0.04       3.4
> "getOption"             0.02      1.7       0.04       3.4
> "as.character"          0.02      1.7       0.02       1.7
> "gc"                    0.02      1.7       0.02       1.7
> "options"               0.02      1.7       0.02       1.7
> "system.time"           0.00      0.0       1.18     100.0
>
> $by.total
>                  total.time total.pct self.time self.pct
> "system.time"            1.18     100.0      0.00      0.0
> ".packages"              1.16      98.3      0.14     11.9
> ".readRDS"               0.64      54.2      0.34     28.8
> "gzfile"                 0.16      13.6      0.12     10.2
> "file.exists"            0.14      11.9      0.14     11.9
> "close"                  0.14      11.9      0.10      8.5
> "grepl"                  0.10       8.5      0.08      6.8
> "$"                      0.08       6.8      0.08      6.8
> "file.path"              0.06       5.1      0.06      5.1
> "close.connection"       0.04       3.4      0.04      3.4
> "getOption"              0.04       3.4      0.02      1.7
> "as.character"           0.02       1.7      0.02      1.7
> "gc"                     0.02       1.7      0.02      1.7
> "options"                0.02       1.7      0.02      1.7
>
> $sampling.time
> [1] 1.18
>
> I'd like to know what is happening in the 35.589  - 1.18 seconds, and the 
> profiler won't tell me.
>
> About the time spent by grepl, we could take this down by calling it once 
> instead of many times
>
>> system.time( grepl( valid_package_version_regexp, versions ) )
>  user  system elapsed
> 0.003   0.000   0.009
>> system.time( for(v in versions) grepl( valid_package_version_regexp, v ) ) 
> user  system elapsed
> 0.100   0.000   0.136
>
>>> Consider this script below,
>> 
>> Whether profiling works in other examples is beside the point here.
> I'll make another thread on that.
>
> -- 
> Romain Francois
> Independent R Consultant
> +33(0) 6 28 91 30 30
> http://romainfrancois.blog.free.fr
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-devel mailing list