[Rd] execution time of .packages
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue Mar 3 13:41:44 CET 2009
Let me repeat: what is happening for me in the equivalent of your
35.589 - 1.18 seconds is that R is waiting for my OS to read its discs
(and they can be heard chuntering away). As the R process is not
runniing at those times, the profiler is not running either (on a
Unix-alike: on Windows the profiler does measure elapsed time). I
expect it will be the same explanation for you.
What I have already suggested is that if you want to save time, do not
read and check the package.rds files. As far as I can see they were
checked at installation in any recent version of R. Just check their
existence.
On Tue, 3 Mar 2009, Romain Francois wrote:
> Prof Brian Ripley wrote:
>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>
>>> Prof Brian Ripley wrote:
>>>> The caching is in the disc system: you need to find and read the package
>>>> metadata for every package. AFAIK it is not easy to flush the disc
>>>> cache, but quite easy to overwrite it with later reads. (Google for more
>>>> info.)
>>> Thanks for the info, I'll try to find my way with these directions.
>>>
>>>> If you are not concerned about validity of the installed packages you
>>>> could skip the tests and hence the reads.
>>>>
>>>> Your times are quite a bit slower than mine, so a faster disc system
>>>> might help. Since my server has just been rebooted (for a new kernel),
>>>> with all of CRAN and most of BioC I get
>>>>
>>>>> system.time( packs <- .packages( all = T ) )
>>>> user system elapsed
>>>> 0.518 0.262 25.042
>>>>> system.time( packs <- .packages( all = T ) )
>>>> user system elapsed
>>>> 0.442 0.080 0.522
>>>>> length(packs)
>>>> [1] 2096
>>>>
>>>> There's a similar issue when installing packages: the Perl code reads the
>>>> indices from every visible package to resolve links, and that can be slow
>>>> the first time.
>>>>
>>>>
>>>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> The first time in a session I call .packages( all.available = T ), it
>>>>> takes a long time (I have many packages installed from CRAN):
>>>>>
>>>>>> system.time( packs <- .packages( all = T ) )
>>>>> user system elapsed
>>>>> 0.738 0.276 43.787
>>>>>
>>>>> When I call it again, the time is now much reduced, so there must be
>>>>> some caching somewhere. I would like to try to reduce the time it takes
>>>>> the first time, but I have not been able to identify where the caching
>>>>> takes place, and so how I can remove it to try to improve the running
>>>>> time without the caching. Without this, I have to restart my computer
>>>>> each time to vanish the caching to test a new version of the function
>>>>> (this is not going to happen)
>>>>>
>>>>> Here is the .packages function, I am suspicious about this part : "ans
>>>>> <- c(ans, nam)" which grows the ans vector each time a suitable package
>>>>> is found, this does not sound right.
>>>>
>>>> It's OK as there are only going to be ca 2000 packages. Try profiling
>>>> this: .readRDS and grepl take most of the time.
>>
>>> I usually do not trust the result of the profiler when a for loop is
>>> involved, as it tends to miss the point (or maybe I am).
>>
>> Here are the data for the actual example (repeated for this message):
>>
>>> Rprof()
>>> system.time( packs <- .packages( all = T ) )
>> user system elapsed
>> 0.447 0.078 0.525
>>> Rprof(NULL)
>>> summaryRprof()
>> $by.self
>> self.time self.pct total.time total.pct
>> "grepl" 0.18 34.6 0.18 34.6
>> ".readRDS" 0.12 23.1 0.20 38.5
>> ".packages" 0.08 15.4 0.50 96.2
>> "close.connection" 0.04 7.7 0.04 7.7
>> "close" 0.02 3.8 0.06 11.5
>> "file.exists" 0.02 3.8 0.02 3.8
>> "gc" 0.02 3.8 0.02 3.8
>> "gzfile" 0.02 3.8 0.02 3.8
>> "list" 0.02 3.8 0.02 3.8
>> "system.time" 0.00 0.0 0.52 100.0
>> "file.path" 0.00 0.0 0.02 3.8
>>
>> $by.total
>> total.time total.pct self.time self.pct
>> "system.time" 0.52 100.0 0.00 0.0
>> ".packages" 0.50 96.2 0.08 15.4
>> ".readRDS" 0.20 38.5 0.12 23.1
>> "grepl" 0.18 34.6 0.18 34.6
>> "close" 0.06 11.5 0.02 3.8
>> "close.connection" 0.04 7.7 0.04 7.7
>> "file.exists" 0.02 3.8 0.02 3.8
>> "gc" 0.02 3.8 0.02 3.8
>> "gzfile" 0.02 3.8 0.02 3.8
>> "list" 0.02 3.8 0.02 3.8
>> "file.path" 0.02 3.8 0.00 0.0
>>
>> $sampling.time
>> [1] 0.52
>>
>> there is little tiime unaccounted for, and 0.38 sec is going in .readRDS
>> and grepl. Whereas
>>
>> system.time({
>> ans <- character(0)
>> for(i in 1:2096) ans <- c(ans, "foo")
>> })
>>
>> takes 0.024 secs, negligible here (one profiler tick).
> Here is what happens to me if I restart the computer:
>
>> Rprof( )
>> system.time( packs <- .packages( all = T ) )
> user system elapsed
> 0.888 0.342 35.589
>> Rprof(NULL)
>> summaryRprof()
> $by.self
> self.time self.pct total.time total.pct
> ".readRDS" 0.34 28.8 0.64 54.2
> ".packages" 0.14 11.9 1.16 98.3
> "file.exists" 0.14 11.9 0.14 11.9
> "gzfile" 0.12 10.2 0.16 13.6
> "close" 0.10 8.5 0.14 11.9
> "grepl" 0.08 6.8 0.10 8.5
> "$" 0.08 6.8 0.08 6.8
> "file.path" 0.06 5.1 0.06 5.1
> "close.connection" 0.04 3.4 0.04 3.4
> "getOption" 0.02 1.7 0.04 3.4
> "as.character" 0.02 1.7 0.02 1.7
> "gc" 0.02 1.7 0.02 1.7
> "options" 0.02 1.7 0.02 1.7
> "system.time" 0.00 0.0 1.18 100.0
>
> $by.total
> total.time total.pct self.time self.pct
> "system.time" 1.18 100.0 0.00 0.0
> ".packages" 1.16 98.3 0.14 11.9
> ".readRDS" 0.64 54.2 0.34 28.8
> "gzfile" 0.16 13.6 0.12 10.2
> "file.exists" 0.14 11.9 0.14 11.9
> "close" 0.14 11.9 0.10 8.5
> "grepl" 0.10 8.5 0.08 6.8
> "$" 0.08 6.8 0.08 6.8
> "file.path" 0.06 5.1 0.06 5.1
> "close.connection" 0.04 3.4 0.04 3.4
> "getOption" 0.04 3.4 0.02 1.7
> "as.character" 0.02 1.7 0.02 1.7
> "gc" 0.02 1.7 0.02 1.7
> "options" 0.02 1.7 0.02 1.7
>
> $sampling.time
> [1] 1.18
>
> I'd like to know what is happening in the 35.589 - 1.18 seconds, and the
> profiler won't tell me.
>
> About the time spent by grepl, we could take this down by calling it once
> instead of many times
>
>> system.time( grepl( valid_package_version_regexp, versions ) )
> user system elapsed
> 0.003 0.000 0.009
>> system.time( for(v in versions) grepl( valid_package_version_regexp, v ) )
> user system elapsed
> 0.100 0.000 0.136
>
>>> Consider this script below,
>>
>> Whether profiling works in other examples is beside the point here.
> I'll make another thread on that.
>
> --
> Romain Francois
> Independent R Consultant
> +33(0) 6 28 91 30 30
> http://romainfrancois.blog.free.fr
>
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list