[Rd] Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Mar 26 21:34:42 CET 2025
On 3/26/25 17:47, Sergio Oller wrote:
> Hello,
>
> I would like to submit a patch to R. Following 5 Submitting Feature
> Requests – R Development Guide
> <https://contributor.r-project.org/rdevguide/chapters/submitting_feature_requests.html>,
> I would like to ask for feedback before proceeding with a ¿formal?
> submission on bugzilla. It's my first attempt contributing to R and I do
> not currently have a bugzilla account.
>
> I am working at a company, and we use R with databricks. We want to install
> some packages on a distributed filesystem that is not fully POSIX
> compliant, as it does not support opening files in append mode. In C terms,
> `open(filename, "a")` gives an error. I guess other distributed file
> systems beyond the ones in databricks may have issues with append mode as
> well.
>
> Our current workaround is to install all packages on a local folder, and
> then copy/move the folder to the distributed file system.
This is something we try to keep working in R if possible, to allow
users moving installed packages by moving the installation directories.
If this practice works for you, it is probably fine.
Currently, installing a binary package just means unpacking it to the
target directory. Probably you could do this also via binary packages:
build binary packages on a local filesystem, and then install them to
the non-POSIX filesystem (provided the unpacking/installation would work
on such a filesystem). If the installation of a binary package doesn't
work but could be (possibly optionally) made work, that might be of
interest.
> If I understand package installation correctly, when a package is
> installed, the installation happens inside a 00LOCK directory, and then the
> outcome is moved to the final destination.
>
> The contribution I would like to submit allows users/sysadmins to set an
> environment variable named PKG_LOCKDIR_PREFIX, that defines the location
> where the "00LOCK-" directories are created. The patch is backwards
> compatible and it consists of +28,-10 lines, hopefully easy enough to
> review.
>
> https://github.com/r-devel/r-svn/pull/196.diff
>
> When I use this patch, I can successfully install packages on a distributed
> file system by setting PKG_LOCKDIR_PREFIX to a directory in my local
> filesystem (R does all the file append stuff in the local file system, and
> finally copies all the package files to the distributed file system)
I am not excited about the idea combining this with the locking
mechanism and staged installation in the described way. The current
implementation takes advantage of that on a single filesystem, a move
operation is either atomic (POSIX) or at least very fast (Windows).
Copying an installed package to a different filesystem isn't. There is a
risk that some other R session could see a partial installation of a
package. Then, if the library was on a distributed filesystem accessed
from different machines, there could even be corruption due to
concurrent installation from multiple machines. In principle, this could
be even on a single machine (checking existence of a directory on one
filesystem and creating it on another wouldn't be atomic).
Perhaps the staging/locking could be implemented in some special way on
the target filesystem, some second-level staging and installation - but
it is questionable whether it is worth the effort/maintenance in base R.
Also keep in mind this could hardly be regularly tested as such
filesystems are rare.
Best
Tomas
P.S.
about staged installation:
https://developer.r-project.org/Blog/public/2019/02/14/staged-install/index.html
>
> This setting makes package installation transparent for all data
> scientists, since they may not even know that PKG_LOCKDIR_PREFIX has been
> set. Package installation just works as expected.
>
> I feel the patch has some added value over our workaround: Even if we
> implement the workaround with a simple wrapper over install.packages(), any
> third party package that depends on install.packages() (such as renv or
> others) won't use our workaround. Besides, with this patch merged any other
> R user benefits from being able to install packages in those filesystems.
>
> Any feedback is very much appreciated.
>
> Thanks for your time,
>
> Sergio
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list