[Rd] [RFC] A case for freezing CRAN

Rainer M Krug Rainer at krugs.de
Thu Mar 20 09:49:12 CET 2014


Michael Weylandt <michael.weylandt at gmail.com> writes:

> On Mar 19, 2014, at 22:17, Gavin Simpson <ucfagls at gmail.com> wrote:
>
>> Michael,
>> 
>> I think the issue is that Jeroen wants to take that responsibility out
>> of the hands of the person trying to reproduce a work. If it used R
>> 3.0.x and packages A, B and C then it would be trivial to to install
>> that version of R and then pull down the stable versions of A B and C
>> for that version of R. At the moment, one might note the packages used
>> and even their versions, but what about the versions of the packages
>> that the used packages rely upon & so on? What if developers don't
>> state know working versions of dependencies?
>
> Doesn't sessionInfo() give all of this?
>
> If you want to be very worried about every last bit, I suppose it
> should also include options(), compiler flags, compiler version, BLAS
> details, etc.  (Good talk on the dregs of a floating point number and
> how hard it is to reproduce them across processors
> http://www.youtube.com/watch?v=GIlp4rubv8U)

In principle yes - but this calls specifically for a package which is
extracting the info and stores it into a human readable format, which
can then be used to re-install (automatically) all the versions for
(hopefully) reproducibility - because if there are external libraries
included, you HAVE problems.

>
>> 
>> The problem is how the heck do you know which versions of packages are
>> needed if developers don't record these dependencies in sufficient
>> detail? The suggested solution is to freeze CRAN at intervals
>> alongside R releases. Then you'd know what the stable versions were.
>
> Only if you knew which R release was used. 

Well - that would be easier to specify in a paper then the version infos
of all packages needed - and which ones of the installed ones are
actually needed? OK - the ones specified in library() calls. But wait -
there are dependencies, imports, ... That is a lot of digging - I wpul;d
not know how to do this out of my head, except by digging through the
DESCRIPTION files of the packages...

>
>> 
>> Or we could just get package developers to be more thorough in
>> documenting dependencies. Or R CMD check could refuse to pass if a
>> package is listed as a dependency but with no version qualifiers. Or
>> have R CMD build add an upper bound (from the current, at build-time
>> version of dependencies on CRAN) if the package developer didn't
>> include and upper bound. Or... The first is unliekly to happen
>> consistently, and no-one wants *more* checks and hoops to jump through
>> :-)
>> 
>> To my mind it is incumbent upon those wanting reproducibility to build
>> the tools to enable users to reproduce works.
>
> But the tools already allow it with minimal effort. If the author
> can't even include session info, how can we be sure the version of R
> is known. If we can't know which version of R, can we ever change R at
> all? Etc to absurdity.
>
> My (serious) point is that the tools are in place, but ramming them
> down folks' throats by intentionally keeping them on older versions by
> default is too much.
>
>> When you write a paper
>> or release a tool, you will have tested it with a specific set of
>> packages. It is relatively easy to work out what those versions are
>> (there are tools in R for this). What is required is an automated way
>> to record that info in an agreed upon way in an approved
>> file/location, and have a tool that facilitates setting up a package
>> library sufficient with which to reproduce a work. That approval
>> doesn't need to come from CRAN or R Core - we can store anything in
>> ./inst.
>
> I think the package version and published paper cases are different. 
>
> For the latter, the recipe is simple: if you want the same results,
> use the same software (as noted by sessionInfoPlus() or equiv)

Dependencies, imports, package versions, ... not that straight forward I
would say.

>
> For the former, I think you start straying into this NP complete problem: http://people.debian.org/~dburrows/model.pdf 
>
> Yes, a good config can (and should be recorded) but isn't that exactly what sessionInfo() gives?
>
>> 
>> Reproducibility is a very important part of doing "science", but not
>> everyone using CRAN is doing that. Why force everyone to march to the
>> reproducibility drum? I would place the onus elsewhere to make this
>> work.
>> 
>
> Agreed: reproducibility is the onus of the author, not the reader

Exactly - but also the authors of the software which is aimed at being
used in the context of reproducibility - the tools should be there to
make it easy!

My points are:

1) I think the snapshot idea of CRAN is a good idea which should be
followed
2) The snapshots should be incorporated at CRAN as I assume that CRAN
will be there longer then any third party repository.
3) the default for the user should *not* change, i.e. normal users will
always get the newest packages as it is now
4) If this can / will not be done because of workload, storage space,
... commands should be incorporated in a package (preferably which
becomes part of the core packages) to store snapshots of installed
package and R version information as a human readable text file, but
which can be parsed by a second command to re-create this setup.

Cheers, and thanks for this important discussion (could have been a GSoC
project?),

Rainer


>
>
>> Gavin
>> A scientist, very much interested in reproducibility of my work and others.
>
> Michael
> In finance, where we call it "Auditability" and care very much as well :-)
>
>
> 	[[alternative HTML version deleted]]
>

-- 
Rainer M. Krug
email: Rainer<at>krugs<dot>de
PGP: 0x0F52F982
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 494 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20140320/0cb9bc6f/attachment.bin>


More information about the R-devel mailing list