[Rd] Proposal to limit Internet access during package load
Simon Urbanek
@|mon@urb@nek @end|ng |rom R-project@org
Mon Sep 26 21:49:43 CEST 2022
> On Sep 27, 2022, at 8:25 AM, Iñaki Ucar <iucar using fedoraproject.org> wrote:
>
> On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
> <simon.urbanek using r-project.org> wrote:
>>
>> Iñaki,
>>
>> I fully agree, this a very common issue since vast majority of server deployments I have encountered don't allow internet access. In practice this means that such packages are effectively banned.
>>
>> I would argue that not even (1) or (2) are really an issue, because in fact the CRAN policy doesn't impose any absolute limits on size, it only states that the package should be "of minimum necessary size" which means it shouldn't waste space. If there is no way to reduce the size without impacting functionality, it's perfectly fine.
>
> "Packages should be of the minimum necessary size" is subject to
> interpretation. And in practice, there is an issue with e.g. packages
> that "bundle" big third-party libraries. There are also packages that
> require downloading precompiled code, JARs... at installation time.
>
JARs are part of the package, so that's a valid use, no question there, that's how Java packages do this already.
Downloading pre-compiled binaries is something that shouldn't be done and a whole can of worms (since those are not sources and it *is* specific to the platform, os etc.) that is entirely separate, but worth a separate discussion. So I still don't see any use cases for actual sources. I do see a need for better specification of external dependencies which are not part of the package such that those can be satisfied automatically - but that's not the problem you asked about.
>> That said, there are exceptions such as very large datasets (e.g., as distributed by Bioconductor) which are orders of magnitude larger than what is sustainable. I agree that it would be nice to have a mechanism for specifying such sources. So yes, I like the idea, but I'd like to see more real use cases to justify the effort.
>
> "More real use cases" like in "more use cases" or like in "the
> previous ones are not real ones"? :)
>
>> The issue with any online downloads, though, is that there is no guarantee of availability - which is real issue for reproducibility. So one could argue that if such external sources are required then they should be on a well-defined, independent, permanent storage such as Zenodo. This could be a matter of policy as opposed to the technical side above which would be adding such support to R CMD INSTALL.
>
> Not necessarily. If the package declares the additional sources in the
> DESCRIPTION (probably with hashes), that's a big improvement over the
> current state of things, in which basically we don't know what the
> package tries download, then it may fail, and finally there's no
> guarantee that it's what the author intended in the first place.
>
> But on top of this, R could add a CMD to download those, and then some
> lookaside storage could be used on CRAN. This is e.g. how RPM
> packaging works: the spec declares all the sources, they are
> downloaded once, hashed and stored in a lookaside cache. Then package
> building doesn't need general Internet connectivity, just access to
> the cache.
>
Sure, I fully agree that it would be a good first step, but I'm still waiting for examples ;).
Cheers,
Simon
> Iñaki
>
>>
>> Cheers,
>> Simon
>>
>>
>>> On Sep 24, 2022, at 3:22 AM, Iñaki Ucar <iucar using fedoraproject.org> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to open this debate here, because IMO this is a big issue.
>>> Many packages do this for various reasons, some more legitimate than
>>> others, but I think that this shouldn't be allowed, because it
>>> basically means that installation fails in a machine without Internet
>>> access (which happens e.g. in Linux distro builders for security
>>> reasons).
>>>
>>> Now, what if connection is suppressed during package load? There are
>>> basically three use cases out there:
>>>
>>> (1) The package requires additional files for the installation (e.g.
>>> the source code of an external library) that cannot be bundled into
>>> the package due to CRAN restrictions (size).
>>> (2) The package requires additional files for using it (e.g.,
>>> datasets, a JAR...) that cannot be bundled into the package due to
>>> CRAN restrictions (size).
>>> (3) Other spurious reasons (e.g. the maintainer decided that package
>>> load was a good place to check an online service availability, etc.).
>>>
>>> Again IMO, (3) shouldn't be allowed in any case; (2) should be a
>>> separate function that the user actively calls to download the files,
>>> and those files should be placed into the user dir, and (3) is the
>>> only legitimate use, but then other mechanism should be provided to
>>> avoid connections during package load.
>>>
>>> My proposal to support (3) would be to add a new field in the
>>> DESCRIPTION, "Additional_sources", which would be a comma separated
>>> list of additional resources to download during R CMD INSTALL. Those
>>> sources would be downloaded by R CMD INSTALL if not provided via an
>>> option (to support offline installations), and would be placed in a
>>> predefined place for the package to find and configure them (via an
>>> environment variable or in a predefined subdirectory).
>>>
>>> This proposal has several advantages. Apart from the obvious one
>>> (Internet access during package load can be limited without losing
>>> current functionalities), it gives more visibility to the resources
>>> that packages are using during the installation phase, and thus makes
>>> those installations more reproducible and more secure.
>>>
>>> Best,
>>> --
>>> Iñaki Úcar
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
>
> --
> Iñaki Úcar
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list