[Rd] [RFC] A case for freezing CRAN

Duncan Murdoch murdoch.duncan at gmail.com
Wed Mar 19 13:52:36 CET 2014


I don't see why CRAN needs to be involved in this effort at all.  A 
third party could take snapshots of CRAN at R release dates, and make 
those available to package users in a separate repository.  It is not 
hard to set a different repository than CRAN as the default location 
from which to obtain packages.

The only objection I can see to this is that it requires extra work by 
the third party, rather than extra work by the CRAN team. I don't think 
the total amount of work required is much different.  I'm very 
unsympathetic to proposals to dump work on others.

Duncan Murdoch

On 18/03/2014 4:24 PM, Jeroen Ooms wrote:
> This came up again recently with an irreproducible paper. Below an
> attempt to make a case for extending the r-devel/r-release cycle to
> CRAN packages. These suggestions are not in any way intended as
> criticism on anyone or the status quo.
>
> The proposal described in [1] is to freeze a snapshot of CRAN along
> with every release of R. In this design, updates for contributed
> packages treated the same as updates for base packages in the sense
> that they are only published to the r-devel branch of CRAN and do not
> affect users of "released" versions of R. Thereby all users, stacks
> and applications using a particular version of R will by default be
> using the identical version of each CRAN package. The bioconductor
> project uses similar policies.
>
> This system has several important advantages:
>
> ## Reproducibility
>
> Currently r/sweave/knitr scripts are unstable because of ambiguity
> introduced by constantly changing cran packages. This causes scripts
> to break or change behavior when upstream packages are updated, which
> makes reproducing old results extremely difficult.
>
> A common counter-argument is that script authors should document
> package versions used in the script using sessionInfo(). However even
> if authors would manually do this, reconstructing the author's
> environment from this information is cumbersome and often nearly
> impossible, because binary packages might no longer be available,
> dependency conflicts, etc. See [1] for a worked example. In practice,
> the current system causes many results or documents generated with R
> no to be reproducible, sometimes already after a few months.
>
> In a system where contributed packages inherit the r-base release
> cycle, scripts will behave the same across users/systems/time within a
> given version of R. This severely reduces ambiguity of R behavior, and
> has the potential of making reproducibility a natural part of the
> language, rather than a tedious exercise.
>
> ## Repository Management
>
> Just like scripts suffer from upstream changes, so do packages
> depending on other packages. A particular package that has been
> developed and tested against the current version of a particular
> dependency is not guaranteed to work against *any future version* of
> that dependency. Therefore, packages inevitably break over time as
> their dependencies are updated.
>
> One recent example is the Rcpp 0.11 release, which required all
> reverse dependencies to be rebuild/modified. This updated caused some
> serious disruption on our production servers. Initially we refrained
> from updating Rcpp on these servers to prevent currently installed
> packages depending on Rcpp to stop working. However soon after the
> Rcpp 0.11 release, many other cran packages started to require Rcpp >=
> 0.11, and our users started complaining about not being able to
> install those packages. This resulted in the impossible situation
> where currently installed packages would not work with the new Rcpp,
> but newly installed packages would not work with the old Rcpp.
>
> Current CRAN policies blame this problem on package authors. However
> as is explained in [1], this policy does not solve anything, is
> unsustainable with growing repository size, and sets completely the
> wrong incentives for contributing code. Progress comes with breaking
> changes, and the system should be able to accommodate this. Much of
> the trouble could have been prevented by a system that does not push
> bleeding edge updates straight to end-users, but has a devel branch
> where conflicts are resolved before publishing them in the next
> r-release.
>
> ## Reliability
>
> Another example, this time on a very small scale. We recently
> discovered that R code plotting medal counts from the Sochi Olympics
> generated different results for users on OSX than it did on
> Linux/Windows. After some debugging, we narrowed it down to the XML
> package. The application used the following code to scrape results
> from the Sochi website:
>
> XML::readHTMLTable("http://www.sochi2014.com/en/speed-skating", which=2, skip=1)
>
> This code was developed and tested on mac, but results in a different
> winner on windows/linux. This happens because the current version of
> the XML package on CRAN is 3.98, but the latest mac binary is 3.95.
> Apparently this new version of XML introduces a tiny change that
> causes html-table-headers to become colnames, rather than a row in the
> matrix, resulting in different medal counts.
>
> This example illustrates that we should never assume package versions
> to be interchangeable. Any small bugfix release can have side effects
> altering results. It is impossible to protect code against such
> upstream changes using CMD check or unit testing. All R scripts and
> packages are really only developed and tested for a single version of
> their dependencies. Assuming anything else makes results
> untrustworthy, and code unreliable.
>
> ## Summary
>
> Extending the r-release cycle to CRAN seems like a solution that would
> be easy to implement. Package updates simply only get pushed to the
> r-devel branches of cran, rather than r-release and r-release-old.
> This separates development from production/use in a way that is common
> sense in most open source communities. Benefits for R include:
>
> - Regular R users (statisticians, researchers, students, teachers) can
> share their homemade scripts/documents/packages and rely on them to
> work and produce the same results within a given version of R, without
> manual efforts to manage package versions.
>
> - Package authors can publish breaking changes to the devel branch
> without causing major disruption or affecting users and/or
> maintainers. Authors of depending packages have a timeframe to sync
> their package with upstream changes before the next release.
>
> - CRAN maintainers can focus quality control and testing efforts on
> the devel branch around the time of the code freeze. No need for
> crisis management when a package update introduces some severe
> breaking changes. Users of released versions are unaffected.
>
>
> [1] http://journal.r-project.org/archive/2013-1/ooms.pdf
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list