[Rd] Big speedup in install.packages() by re-using connections
Jeroen Ooms
jeroenoom@ @end|ng |rom gm@||@com
Sat Mar 2 15:07:08 CET 2024
Currently download.file() creates and terminates a new TLS connection
for each download. This creates a lot of overhead which is expensive
for both client and server (in particular the TLS handshake). Modern
internet clients (including browsers) re-use connections for many http
requests.
We can do this in R by creating a persistent libcurl "multi-handle".
The R libcurl implementation already uses a multi-handle, however it
destroys it after each download, which defeats the purpose. The
purpose of the multi-handle is to keep it alive and let libcurl
maintain a persistent connection pool. This is particularly relevant
for install.packages() which needs to download many files from one and
the same server.
Here is a bare minimal proof of concept patch that re-uses one and the
same multi-handle for all requests in R:
https://github.com/r-devel/r-svn/pull/155/files
Some quick benchmarking shows that this can lead to big speedups for
download.packages() on high-bandwidth servers (such as CI). This quick
test to download 100 packages from CRAN showed more than 10x speedup
for me: https://github.com/r-devel/r-svn/pull/155
Moreover, I think this may make install.packages() more robust. In CI
build logs that download many packages, I often see one or two
downloads randomly failing with a TLS-connect error. I am hopeful this
problem will disappear when using a single connection to the CRAN
server to download all the packages.
More information about the R-devel
mailing list