[Rd] Big speedup in install.packages() by re-using connections
Jeroen Ooms
jeroenoom@ @end|ng |rom gm@||@com
Mon Sep 9 18:19:12 CEST 2024
On Mon, Sep 9, 2024 at 11:11 AM Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
>
>
> On 9/8/24 23:14, Jeroen Ooms wrote:
> > On Mon, Sep 2, 2024 at 10:05 AM Tomas Kalibera <tomas.kalibera using gmail.com> wrote:
> >>
> >> On 4/25/24 17:01, Ivan Krylov via R-devel wrote:
> >>> On Thu, 25 Apr 2024 14:45:04 +0200
> >>> Jeroen Ooms <jeroenooms using gmail.com> wrote:
> >>>
> >>>> Thoughts?
> >>> How verboten would it be to create an empty external pointer object,
> >>> add it to the preserved list, and set an on-exit finalizer to clean up
> >>> the curl multi-handle? As far as I can tell, the internet module is not
> >>> supposed to be unloaded, so this would not introduce an opportunity to
> >>> jump to an unmapped address. This makes it possible to avoid adding a
> >>> CurlCleanup() function to the internet module:
> >> Cleaning up this way in principle would probably be fine, but R already
> >> has support for re-using connections. Even more, R can download files in
> >> parallel (in a single thread), which particularly helps with bigger
> >> latencies (e.g. typically users connecting from home, etc). See
> >> ?download.file(), look for "simultaneous".
> > Thank you for looking at this. A few ideas wrt parallel downloading:
> >
> > Additional improvement on Windows can be achieved by enabling the
> > nghttp2 driver in libcurl in rtools, such that it takes advantage of
> > http2 multiplexing for parallel downloads
> > (https://bugs.r-project.org/show_bug.cgi?id=18664).
>
> Anyone who wants to cooperate and help is more than welcome to
> contribute patches to upstream MXE.
>
> In case of nghttp2, thanks to Andrew Johnson, who contributed nghttp2
> support to upstream MXE. It will be part of the next Rtools (probably
> Rtools45).
>
> > Moreover, one concern is that install.packages() may fail more
> > frequently on low bandwidth connections due to reaching the "download
> > timeout" when downloading files in parallel:
> >
> > R has an unusual definition of the http timeout, which by default
> > aborts in-progress downloads after 60 seconds for no obvious reason.
> > (by contrast, browsers enforce a timeout on unresponsive/stalled
> > downloads only, which can be achieved in libcurl by setting
> > CURLOPT_CONNECTTIMEOUT or CURLOPT_LOW_SPEED_TIME).
> >
> > The above is already a problem on slow networks, where large packages
> > can fail to install with a timeout error in the download stage. Users
> > may assume there must be a problem with the network, as it is not
> > obvious that machines on slower internet connection need to work
> > around R's defaults and modify options(timeout) before
> > install.packages(). This problem could become more prevalent when
> > using parallel downloads while still enforcing the same total timeout.
> >
> > For example: the MacOS binary for package "sf" is close to 90mb, hence
> > currently, under the default R settings of options(timeout=60),
> > install.packages will error with a download timeout on clients with
> > less than 1.5MB/s bandwidth. But with the parallel implementation,
> > install.packages() will share the bandwidth on 6 parallel downloads,
> > so if "sf" is downloaded with all its dependencies, we need at least
> > 9MB/s (i.e. a 100mbit connection) for the default settings to not
> > cause a timeout.
> >
> > Hopefully this can be revised to enforce the timeout on stalled
> > downloads only, as is common practice.
>
> Yes, this is work in progress, I am aware that the timeout could use
> some thought re simultaneous downloads.
OK that is good to hear.
> If anyone wants to help with testing the current implementation of
> simultaneous download and report any bugs found, that would be nice.
R-universe has ran this a few thousand times to recheck packages on
r-devel on both linux and windows, and it works well. It reduces the
CI process by a few seconds, and there are less random connection
failures. If you want to inspect some recent logs for yourself, click
the rightmost column on https://r-universe.dev/builds and then on the
GitHub Actions page, look under the "Build R-devel for Windows /
Linux" runs to see the log files.
I was also able to confirm an edge case that install.packages() does
not abort if any of the dependencies fails to download with http-404,
which I think is desired behavior. If there is anything else
specifically that you would like to see tested I can look at that.
More information about the R-devel
mailing list