[Rd] enabling reproducible research & R package management & install.package.version & BiocLite

Tue Mar 5 00:04:25 CET 2013

Just my 2 cents: it may not be a good idea to restrict software
versions to gain reproducibility. To me, this kind of reproducibility
is "dead" reproducibility (what if the old software has a fatal bug?
do we want to reproduce the same **wrong** results?). Software
packages are continuously evolving, and our research should be adapted
as well. How to achieve this? I think this paper by Robert Gentleman
and Duncan Temple Lang has given a nice answer:
http://biostats.bepress.com/bioconductor/paper2/

With R 3.0.0 coming, it will be easy to achieve what they have
outlined because R 3.0 allows custom vignette builders. Basically,
your research paper can be built with 'R CMD build' and checked with
'R CMD check' if you provide an appropriate builder. An R package has
the great potential of becoming the ideal tool for reproducible
research due to its wonderful infrastructure: functions, datasets,
examples, unit tests, vignettes, dependency structure, and so on. With
the help of version control, you can easily spot the changes after you
upgrade the packages. With an R package, you can automate a lot of
things, e.g. install.packages() will take care of dependencies and R
CMD build can rebuild your paper.

Just like Bioc has a devel version, you can continuously check your
results in a devel version, so that you know what is going to break if
you upgrade to new versions of other packages. Is developing a
research paper too different with developing a software package? (in
the context of computing) Probably not.

Long live the reproducible research!

Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA

On Mon, Mar 4, 2013 at 3:13 PM, Cook, Malcolm <MEC at stowers.org> wrote:
> Hi,
>
> In support of reproducible research at my Institute, I seek an approach to re-creating the R environments in which an analysis has been conducted.
>
> By which I mean, the exact version of R and the exact version of all packages used in a particular R session.
>
> I am seeking comments/criticism of this as a goal, and of the following outline of an approach:
>
> === When all the steps to an workflow have been finalized ===
> * re-run the workflow from beginning to end
> * save the results of sessionInfo() into an RDS file named after the current date and time.
>
> === Later, when desirous of exactly recreating this analysis ===
> * read the (old) sessionInfo() into an R session
> * exit with failure if the running version of R doesn't match
> * compare the old sessionInfo to the currently available installed libraries (i.e. using packageVersion)
> * where there are discrepancies, install the required version of the package (without dependencies) into new library (named after the old sessionInfo RDS file)
>
> Then the analyst should be able to put the new library into the front of .libPaths and run the analysis confident that the same version of the packages.
>
> I have in that past used install-package-version.R  to revert to previous versions of R packages successfully (https://gist.github.com/1503736).  And there is a similar tool in Hadley Wickhams devtools.
>
> But, I don't know if I need something special for (BioConductor) packages that have been installed using biocLite and seek advice here.
>
> I do understand that the R environment is not sufficient to guarantee reproducibility.   Some of my colleagues have suggested saving a virtual machine with all your software/library/data installed. So, I am also in general interested in what other people are doing to this end.  But I am most interested in:
>
> * is this a good idea
> * is there a worked out solution
> * does biocLite introduce special cases
> * where do the dragons lurk
>
> ... and the like
>
> Any tips?
>
> Thanks,
>
> ~ Malcolm Cook
> Stowers Institute / Computation Biology / Shilatifard Lab
>