[Rd] Big speedup in install.packages() by re-using connections
Jeroen Ooms
jeroenoom@ @end|ng |rom gm@||@com
Thu Apr 25 14:45:04 CEST 2024
I'd like to raise this again now that 4.4 is out.
Below is a more complete patch which includes a function to properly
cleanup libcurl when R quits. Implementing this is a little tricky
because libcurl is a separate "module" in R, perhaps there is a better
way, but this works:
view: https://github.com/r-devel/r-svn/pull/166/files
patch: https://github.com/r-devel/r-svn/pull/166.diff
The old patch is still there as well, which is meant a minimal proof
of concept to test the performance gains for reusing the connection:
view: https://github.com/r-devel/r-svn/pull/155/files
patch: https://github.com/r-devel/r-svn/pull/155.diff
Performance gains are greatest on high-bandwidth servers when
downloading many files from the same server (e.g. packages from a cran
mirror). In such cases, currently over 90% of the total time is spent
on initiating and tearing town a separate SSL connection for each file
download.
Thoughts?
On Sat, Mar 2, 2024 at 3:07 PM Jeroen Ooms <jeroenooms using gmail.com> wrote:
>
> Currently download.file() creates and terminates a new TLS connection
> for each download. This creates a lot of overhead which is expensive
> for both client and server (in particular the TLS handshake). Modern
> internet clients (including browsers) re-use connections for many http
> requests.
>
> We can do this in R by creating a persistent libcurl "multi-handle".
> The R libcurl implementation already uses a multi-handle, however it
> destroys it after each download, which defeats the purpose. The
> purpose of the multi-handle is to keep it alive and let libcurl
> maintain a persistent connection pool. This is particularly relevant
> for install.packages() which needs to download many files from one and
> the same server.
>
> Here is a bare minimal proof of concept patch that re-uses one and the
> same multi-handle for all requests in R:
> https://github.com/r-devel/r-svn/pull/155/files
>
> Some quick benchmarking shows that this can lead to big speedups for
> download.packages() on high-bandwidth servers (such as CI). This quick
> test to download 100 packages from CRAN showed more than 10x speedup
> for me: https://github.com/r-devel/r-svn/pull/155
>
> Moreover, I think this may make install.packages() more robust. In CI
> build logs that download many packages, I often see one or two
> downloads randomly failing with a TLS-connect error. I am hopeful this
> problem will disappear when using a single connection to the CRAN
> server to download all the packages.
More information about the R-devel
mailing list