[Rd] [RFC] A case for freezing CRAN

Mon Mar 24 11:28:54 CET 2014

>>>>> Hervé Pagès <hpages at fhcrc.org>
>>>>>     on Thu, 20 Mar 2014 15:23:57 -0700 writes:

    > On 03/20/2014 01:28 PM, Ted Byers wrote:
    >> On Thu, Mar 20, 2014 at 3:14 PM, Hervé Pagès
    >> <hpages at fhcrc.org <mailto:hpages at fhcrc.org>> wrote:
    >> 
    >> On 03/20/2014 03:52 AM, Duncan Murdoch wrote:
    >> 
    >> On 14-03-20 2:15 AM, Dan Tenenbaum wrote:
    >> 
    >> 
    >> 
    >> ----- Original Message -----
    >> 
    >> From: "David Winsemius" <dwinsemius at comcast.net
    >> <mailto:dwinsemius at comcast.net>> To: "Jeroen Ooms"
    >> <jeroen.ooms at stat.ucla.edu
    >> <mailto:jeroen.ooms at stat.ucla.edu>> Cc: "r-devel"
    >> <r-devel at r-project.org <mailto:r-devel at r-project.org>>
    >> Sent: Wednesday, March 19, 2014 11:03:32 PM Subject: Re:
    >> [Rd] [RFC] A case for freezing CRAN
    >> 
    >> 
    >> On Mar 19, 2014, at 7:45 PM, Jeroen Ooms wrote:
    >> 
    >> On Wed, Mar 19, 2014 at 6:55 PM, Michael Weylandt
    >> <michael.weylandt at gmail.com
    >> <mailto:michael.weylandt at gmail.com>> wrote:
    >> 
    >> Reading this thread again, is it a fair summary of your
    >> position to say "reproducibility by default is more
    >> important than giving users access to the newest bug
    >> fixes and features by default?"  It's certainly arguable,
    >> but I'm not sure I'm convinced: I'd imagine that the
    >> ratio of new work being done vs reproductions is rather
    >> high and the current setup optimizes for that already.
    >> 
    >> 
    >> I think that separating development from released
    >> branches can give us both reliability/reproducibility
    >> (stable branch) as well as new features (unstable
    >> branch). The user gets to pick (and you can pick
    >> both!). The same is true for r-base: when using a
    >> 'released' version you get 'stable' base packages that
    >> are up to 12 months old. If you want to have the latest
    >> stuff you download a nightly build of r-devel.  For
    >> regular users and reproducible research it is recommended
    >> to use the stable branch. However if you are a developer
    >> (e.g. package author) you might want to
    >> develop/test/check your work with the latest r-devel.
    >> 
    >> I think that extending the R release cycle to CRAN would
    >> result both in more stable released versions of R, as
    >> well as more freedom for package authors to implement
    >> rigorous change in the unstable branch.  When writing a
    >> script that is part of a production pipeline, or sweave
    >> paper that should be reproducible 10 years from now, or a
    >> book on using R, you use stable version of R, which is
    >> guaranteed to behave the same over time. However when
    >> developing packages that should be compatible with the
    >> upcoming release of R, you use r-devel which has the
    >> latest versions of other CRAN and base packages.
    >> 
    >> 
    >> 
    >> As I remember ... The example demonstrating the need for
    >> this was an XML package that cause an extract from a
    >> website where the headers were misinterpreted as data in
    >> one version of pkg:XML and not in another. That seems
    >> fairly unconvincing. Data cleaning and validation is a
    >> basic task of data analysis. It also seems excessive to
    >> assert that it is the responsibility of CRAN to maintain
    >> a synced binary archive that will be available in ten
    >> years.
    >> 
    >> 
    >> 
    >> CRAN already does this, the bin/windows/contrib directory
    >> has subdirectories going back to 1.7, with packages dated
    >> October 2004. I don't see why it is burdensome to
    >> continue to archive these.  It would be nice if source
    >> versions had a similar archive.
    >> 
    >> 
    >> The bin/windows/contrib directories are updated every day
    >> for active R versions.  It's only when Uwe decides that a
    >> version is no longer worth active support that he stops
    >> doing updates, and it "freezes".  A consequence of this
    >> is that the snapshots preserved in those older
    >> directories are unlikely to match what someone who keeps
    >> up to date with R releases is using.  Their purpose is to
    >> make sure that those older versions aren't completely
    >> useless, but they aren't what Jeroen was asking for.
    >> 
    >> 
    >> But it is almost completely useless from a
    >> reproducibility point of view to get random package
    >> versions. For example if some people try to use R-2.13.2
    >> today to reproduce an analysis that was published 2 years
    >> ago, they'll get Matrix 1.0-4 on Windows, Matrix 1.0-3 on
    >> Mac, and Matrix 1.1-2-2 on Unix. And none of them of
    >> course is what was used by the authors of the paper (they
    >> used Matrix 1.0-1, which is what was current when they
    >> ran their analysis).
    >> 
    >> Initially this discussion brought back nightmares of DLL
    >> hell on Windows.  Those as ancient as I will remember
    >> that well.  But now, the focus seems to be on
    >> reproducibility, but with what strikes me as a seriously
    >> flawed notion of what reproducibility means.
    >> 
    >> Herve Pages mentions the risk of irreproducibility across
    >> three minor revisions of version 1.0 of Matrix.

    > If you use R-2.13.2, you get Matrix 1.1-2-2 on
    > Linux. 

No way!  Matrix 1.1-2-2 has  Depends: R (>= 2.15.2)

    > AFAIK this is the most recent version of Matrix,
    > aimed to be compatible with the most current version of R
    > (i.e. R 3.0.3). However, it has never been tested with R-2.13.2.

Exactly. And for this reason, I have adopted to keep
	 Depends: R (>= ...)
in Matrix and partly, in other packages I maintain.

Doing so does prevent users of old versions of R to get new
features, and even more importantly, get the latest (few, of
course ! ;-) bug-fixes for Matrix.

But apart from this short note.
I'm very sympathetic with optionally providing easier (not
"easy") ways of setting up old versions of R and packages,
where users can pretty quickly use the printed (unfortunately,
for now) output of sessionInfo(), to reinstall 
1) the version of R
2) an install.packages() call which tries (!) to get
   the corresponding packages (in their correct version) from
   CRAN (including ./Archive/ !).. 

similarly to what Duncan Murdoch has agreed to.

    > I'm not saying that it should, that would be a
    > big waste of resources of course. All I'm saying it that
    > it doesn't make sense to serve by default a version that
    > is known to be incompatible with the version of R being
    > used. It's very likely to not even install properly.

    [..............]

    > Also note that back in October 2011, people using R-2.13.2
    > would get e.g. ape 2.7-3 on Linux, Windows and
    > Mac. Wouldn't it make sense that people using R-2.13.2
    > today get the same? Why would anybody use R-2.13.2 today
    > if it's not to run again some code that was written and
    > used two years ago to obtain some important results?

I also tend to agree that it would be great if someone (Karl
Millar -> Google ?) would setup a good time-stamping system for
CRAN {and Bioconductor and Omegahat and ..?} packages.
Ideally that system would work by *using* the CRAN (and ..)
infrastructure.

    > Cheers, H.

I'm still unsure if I should agree with you (Hervé) that some
freezing / "data base of package timestamps" should
happen on-CRAN in addition.

Martin