[Rd] Faster downloads: avoid them if possible
Lluís Revilla
||u|@@rev|||@ @end|ng |rom gm@||@com
Tue Dec 10 22:21:35 CET 2024
Dear Tomas and list,
El mar., 10 dic. 2024 11:33, Tomas Kalibera <tomas.kalibera using gmail.com> escribió:
>
> On 12/10/24 00:35, Lluís Revilla wrote:
> > Dear R-devel,
> >
> > I read with interest the recent blog post on how R will have parallel
> > downloads, on blog.r-project.org
> > (https://blog.r-project.org/2024/12/02/faster-downloads/index.html).
> > Thanks Tomas!
> >
> > The blog mentions that one of the areas where this will be observed is
> > while installing them (which I did!). However, I noticed they might be
> > downloaded multiple times:
> > If one interrupts the install.packages (via Ctrl+C), or it fails due
> > to some system dependency missing and I fix that on a different
> > terminal session, or the internet connection is cut and I try again.
>
> Yes, and this has been the case before - it's not new for simultaneous
> downloads.
Indeed, this behavior has been present before this recent change, the
post just reminded me to look into this.
The change described in the post will help when there is good internet
connections and this is the bottleneck.
My proposal could help those without good internet connection or other issues.
> > One possible way to make installations/downloads faster and also
> > reduce the bandwidth of repositories (and its mirrors) would be to
> > check if they need to be downloaded (again).
> > PACKAGES file on <repo>/src/contrib includes the MD5sum field that
> > could be used to check packages on the local folder (But it might be
> > faster to first check if any file exists there for the same package).
> >
> > In short, I propose:
> > 1) Checking before downloading packages their existence on the destdir
> > directory used by install.packages.
> > 2) I suppose the most common scenario is to use install.packages with
> > the default destdir parameter (NULL). If 1) is implemented it might be
> > useful to keep the temporary directory common for a single R session.
>
> When destdir is NULL (the default), non-local packages are downloaded to
> a subdirectory of the temporary session directory (see
> ?install.packages), so the downloaded files would be readily available
> to further installation attempts done by the same R session.
Perhaps the following test reinstalling the same package it is more
illustrative as we can see the package is downloaded again:
# R Under development (unstable) (2024-12-07 r87428)
td <- tempdir()
install.packages("BaseSet", destdir = td, lib = tempdir())
# trying URL 'https://ftp.cixug.es/CRAN/src/contrib/BaseSet_0.9.0.tar.gz'
# Content type 'application/octet-stream' length 784108 bytes (765 KB)
# ==================================================
# downloaded 765 KB
#....
list.files(td)
# [1] "BaseSet" "BaseSet_0.9.0.tar.gz"
file.info(file.path(td, "BaseSet"))
# size isdir mode mtime
ctime
# /tmp/RtmpO6DpoV/BaseSet 4096 TRUE 755 2024-12-10 17:32:50
2024-12-10 17:32:52
# atime uid gid uname grname
# /tmp/RtmpO6DpoV/BaseSet 2024-12-10 17:32:52 1000 1000 lluis lluis
install.packages("BaseSet", destdir = td, lib = tempdir())
# trying URL 'https://ftp.cixug.es/CRAN/src/contrib/BaseSet_0.9.0.tar.gz'
# Content type 'application/octet-stream' length 784108 bytes (765 KB)
# ==================================================
# downloaded 765 KB
#....
list.files(td)
# [1] "BaseSet" "BaseSet_0.9.0.tar.gz"
file.info(file.path(td, "BaseSet"))
# size isdir mode mtime
ctime
# /tmp/RtmpO6DpoV/BaseSet 4096 TRUE 755 2024-12-10 17:41:18
2024-12-10 17:41:20
# atime uid gid uname grname
# /tmp/RtmpO6DpoV/BaseSet 2024-12-10 17:41:20 1000 1000 lluis lluis
Note the progres bar to download the package even if there is already
present on destdir and the change on mtime on the folder showing the
updated hour.
By default install.packages uses a different temporary folder, set
internally which changes for each call which results in the same
behaviour: packages are downloaded again even if it's not needed
(there is no new BaseSet release between these two calls).
>
> I think we could once extend download.file() to support re-use of
> already downloaded files, so that it can continue an i nterrupted
> download of a single file or re-use the whole file.
>
> This shouldn't be
> the default because the files in general may change between downloads,
> and may be even from different URLs, but it could be used by
> install.packages(), where this shouldn't happen, at least when destdir
> is NULL.
This would be great! I am sure it will have many uses beyond install.packages.
>
> I think an extra round of checking checksums shouldn't be
> needed in install.packages().
As you mentioned file download might change on the websites,
downloading the file again ensures they get the latest.
But if no new download occurs users could install an old version of a package.
That's why I suggested checking the downloaded files on destdir to
check the "cache" and download those that are stale.
If this is already solved in a different way (I couldn't find it on
install.packages source code) it would be great.
Many thanks for your comments,
Lluís
>
>
> Bestª
> Tomas
>
> > I would appreciate feedback on these ideas.
> >
> > Best,
> >
> > Lluís Revilla
> >
> > PD: New users encountering download & installation issues often keep
> > seeing the progress bar (and in the future "trying URL 'https://...")
> > of the same packages. There are some ways to prevent/avoid repeated
> > downloads, such as, using the system library dependency resolver, or
> > having local mirrors. But they are not easy/available for new useRs,
> > and sometimes they are difficult to avoid (like having a reliable
> > internet connection).
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list