[Rd] execution time of .packages
Romain Francois
romain.francois at dbmail.com
Tue Mar 3 13:39:10 CET 2009
Prof Brian Ripley wrote:
> Let me repeat: what is happening for me in the equivalent of your
> 35.589 - 1.18 seconds is that R is waiting for my OS to read its discs
> (and they can be heard chuntering away). As the R process is not
> runniing at those times, the profiler is not running either (on a
> Unix-alike: on Windows the profiler does measure elapsed time). I
> expect it will be the same explanation for you.
Thank you. I get it this time.
> What I have already suggested is that if you want to save time, do not
> read and check the package.rds files. As far as I can see they were
> checked at installation in any recent version of R. Just check their
> existence.
>
> On Tue, 3 Mar 2009, Romain Francois wrote:
>
>> Prof Brian Ripley wrote:
>>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>>
>>>> Prof Brian Ripley wrote:
>>>>> The caching is in the disc system: you need to find and read the
>>>>> package metadata for every package. AFAIK it is not easy to flush
>>>>> the disc cache, but quite easy to overwrite it with later reads.
>>>>> (Google for more info.)
>>>> Thanks for the info, I'll try to find my way with these directions.
>>>>
>>>>> If you are not concerned about validity of the installed packages
>>>>> you could skip the tests and hence the reads.
>>>>>
>>>>> Your times are quite a bit slower than mine, so a faster disc
>>>>> system might help. Since my server has just been rebooted (for a
>>>>> new kernel), with all of CRAN and most of BioC I get
>>>>>
>>>>>> system.time( packs <- .packages( all = T ) )
>>>>> user system elapsed
>>>>> 0.518 0.262 25.042
>>>>>> system.time( packs <- .packages( all = T ) )
>>>>> user system elapsed
>>>>> 0.442 0.080 0.522
>>>>>> length(packs)
>>>>> [1] 2096
>>>>>
>>>>> There's a similar issue when installing packages: the Perl code
>>>>> reads the indices from every visible package to resolve links, and
>>>>> that can be slow the first time.
>>>>>
>>>>>
>>>>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> The first time in a session I call .packages( all.available = T
>>>>>> ), it takes a long time (I have many packages installed from CRAN):
>>>>>>
>>>>>>> system.time( packs <- .packages( all = T ) )
>>>>>> user system elapsed
>>>>>> 0.738 0.276 43.787
>>>>>>
>>>>>> When I call it again, the time is now much reduced, so there must
>>>>>> be some caching somewhere. I would like to try to reduce the
>>>>>> time it takes the first time, but I have not been able to
>>>>>> identify where the caching takes place, and so how I can remove
>>>>>> it to try to improve the running time without the caching.
>>>>>> Without this, I have to restart my computer each time to vanish
>>>>>> the caching to test a new version of the function (this is not
>>>>>> going to happen)
>>>>>>
>>>>>> Here is the .packages function, I am suspicious about this part :
>>>>>> "ans <- c(ans, nam)" which grows the ans vector each time a
>>>>>> suitable package is found, this does not sound right.
>>>>>
>>>>> It's OK as there are only going to be ca 2000 packages. Try
>>>>> profiling this: .readRDS and grepl take most of the time.
>>>
>>>> I usually do not trust the result of the profiler when a for loop
>>>> is involved, as it tends to miss the point (or maybe I am).
>>>
>>> Here are the data for the actual example (repeated for this message):
>>>
>>>> Rprof()
>>>> system.time( packs <- .packages( all = T ) )
>>> user system elapsed
>>> 0.447 0.078 0.525
>>>> Rprof(NULL)
>>>> summaryRprof()
>>> $by.self
>>> self.time self.pct total.time total.pct
>>> "grepl" 0.18 34.6 0.18 34.6
>>> ".readRDS" 0.12 23.1 0.20 38.5
>>> ".packages" 0.08 15.4 0.50 96.2
>>> "close.connection" 0.04 7.7 0.04 7.7
>>> "close" 0.02 3.8 0.06 11.5
>>> "file.exists" 0.02 3.8 0.02 3.8
>>> "gc" 0.02 3.8 0.02 3.8
>>> "gzfile" 0.02 3.8 0.02 3.8
>>> "list" 0.02 3.8 0.02 3.8
>>> "system.time" 0.00 0.0 0.52 100.0
>>> "file.path" 0.00 0.0 0.02 3.8
>>>
>>> $by.total
>>> total.time total.pct self.time self.pct
>>> "system.time" 0.52 100.0 0.00 0.0
>>> ".packages" 0.50 96.2 0.08 15.4
>>> ".readRDS" 0.20 38.5 0.12 23.1
>>> "grepl" 0.18 34.6 0.18 34.6
>>> "close" 0.06 11.5 0.02 3.8
>>> "close.connection" 0.04 7.7 0.04 7.7
>>> "file.exists" 0.02 3.8 0.02 3.8
>>> "gc" 0.02 3.8 0.02 3.8
>>> "gzfile" 0.02 3.8 0.02 3.8
>>> "list" 0.02 3.8 0.02 3.8
>>> "file.path" 0.02 3.8 0.00 0.0
>>>
>>> $sampling.time
>>> [1] 0.52
>>>
>>> there is little tiime unaccounted for, and 0.38 sec is going in
>>> .readRDS and grepl. Whereas
>>>
>>> system.time({
>>> ans <- character(0)
>>> for(i in 1:2096) ans <- c(ans, "foo")
>>> })
>>>
>>> takes 0.024 secs, negligible here (one profiler tick).
>> Here is what happens to me if I restart the computer:
>>
>>> Rprof( )
>>> system.time( packs <- .packages( all = T ) )
>> user system elapsed
>> 0.888 0.342 35.589
>>> Rprof(NULL)
>>> summaryRprof()
>> $by.self
>> self.time self.pct total.time total.pct
>> ".readRDS" 0.34 28.8 0.64 54.2
>> ".packages" 0.14 11.9 1.16 98.3
>> "file.exists" 0.14 11.9 0.14 11.9
>> "gzfile" 0.12 10.2 0.16 13.6
>> "close" 0.10 8.5 0.14 11.9
>> "grepl" 0.08 6.8 0.10 8.5
>> "$" 0.08 6.8 0.08 6.8
>> "file.path" 0.06 5.1 0.06 5.1
>> "close.connection" 0.04 3.4 0.04 3.4
>> "getOption" 0.02 1.7 0.04 3.4
>> "as.character" 0.02 1.7 0.02 1.7
>> "gc" 0.02 1.7 0.02 1.7
>> "options" 0.02 1.7 0.02 1.7
>> "system.time" 0.00 0.0 1.18 100.0
>>
>> $by.total
>> total.time total.pct self.time self.pct
>> "system.time" 1.18 100.0 0.00 0.0
>> ".packages" 1.16 98.3 0.14 11.9
>> ".readRDS" 0.64 54.2 0.34 28.8
>> "gzfile" 0.16 13.6 0.12 10.2
>> "file.exists" 0.14 11.9 0.14 11.9
>> "close" 0.14 11.9 0.10 8.5
>> "grepl" 0.10 8.5 0.08 6.8
>> "$" 0.08 6.8 0.08 6.8
>> "file.path" 0.06 5.1 0.06 5.1
>> "close.connection" 0.04 3.4 0.04 3.4
>> "getOption" 0.04 3.4 0.02 1.7
>> "as.character" 0.02 1.7 0.02 1.7
>> "gc" 0.02 1.7 0.02 1.7
>> "options" 0.02 1.7 0.02 1.7
>>
>> $sampling.time
>> [1] 1.18
>>
>> I'd like to know what is happening in the 35.589 - 1.18 seconds, and
>> the profiler won't tell me.
>>
>> About the time spent by grepl, we could take this down by calling it
>> once instead of many times
>>
>>> system.time( grepl( valid_package_version_regexp, versions ) )
>> user system elapsed
>> 0.003 0.000 0.009
>>> system.time( for(v in versions) grepl( valid_package_version_regexp,
>>> v ) )
>> user system elapsed
>> 0.100 0.000 0.136
>>
>>>> Consider this script below,
>>>
>>> Whether profiling works in other examples is beside the point here.
>> I'll make another thread on that.
>>
>> --
>> Romain Francois
>> Independent R Consultant
>> +33(0) 6 28 91 30 30
>> http://romainfrancois.blog.free.fr
>>
>>
>
--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
More information about the R-devel
mailing list