[Bioc-devel] RFC: Bioc repository for single-version packages

Wolfgang Huber wolfg@ng@huber @ending from embl@de
Tue Nov 13 11:08:30 CET 2018


To summarize some further discussions we had on this, with the bottom 
line that it needs more thought:

The proposal amounts to establishing a third generation of media used 
for academic publishing:
1. Printed paper (since 1665)
2. Portable document format (PDF) files (since 1990s)
3. Executable documents that contain data, code and text

While there are obvious small-scale solutions for 3., incl. those 
sketched by Martin and me, doing this well has similar requirements and 
aspirations for scalability, scope, durability, and time-unlimited 
support as we take for granted for 1.+2. There are millions of papers 
published per year across many disciplines of science.

Besides the technical challenges there are economic and organizational 
ones. The publishing industry should also have a role to play, although 
of course this a fluid area.

There are relevant existing efforts, incl. this incomplete list:

Binders (documents with containers):
https://mybinder.readthedocs.io/en/latest/examples.html

Jupyter/RStudio interfaces to published datasets and results:
https://wholetale.org/index.html

The Pachyderm framework for running pipelines on archived data, tracks
provenance:
http://www.pachyderm.io/

Code Ocean
https://codeocean.com/


---
Thanks to Martin Morgan and Michael Lawrence for input.


5.11.18 23:17, Martin Morgan scripsit:
> This is a continuation of the discussion at
> 
> https://support.bioconductor.org/p/114814/#114824
> 
> Where Wolfgang asks about "creating a corner in the Bioconductor package ecosystem for packages that are only ever supposed to build and check with a single release"
> 
> I think this would be quite challenging to implement correctly, for instance ensuring that the user of an appropriate version of R can easily install the intended dependencies, and what exactly it means for a package to be restricted to a single release, e.g., CRAN packages are updated without versioned releases [I mean, a user of Bioc 3.7 will get the current version of the CRAN package, not the version that was available at the (beginning or end) of the 3.7 release], so presumably the idea is that there is a snapshot of package versions that one requires. This part sounds as much like a job for packrat / switchr etc. Maybe 'our' job is to ensure that the appropriate information is discoverable?
> 
> I took as an example the defunct package BioMedR. Our friend google ("Bioconductor BioMedR") took me to the last-known-good landing page (initially by way of a mirror in Japan...). The DOI on the (bioconductor.org version) of that page took me to the 'Removed packages' ( https://bioconductor.org/about/removed-packages/ ) page, which again points to the last-known-good page. Likewise https://bioconductor.org/packages/BioMedR . The 'In bioc since' tag on the 'last-known-good' page allowed me to find the version of Bioconductor where the package was introduced. With some work I can find the AMI (https://bioconductor.org/help/bioconductor-cloud-ami/ ) and docker images (https://hub.docker.com/r/bioconductor/release_base2/tags/ ) for that release of Bioconductor; neither of these would be sufficient for reproducibility (I could get relevant Bioconductor package versions simply installing the package from our archive via BiocInstaller / BiocManager, but R packages would be more challenging). The package has a (impressively extensive!) vignette, but the vignette does not include sessionInfo() so one has to do considerable extra work to find the relevant packages. Again maybe packrat / switchr help with this...
> 
> I think 'incoming' versions of such packages would go through the usual review process, in an attempt to hue to some sort of overall Bioconductor standard of quality; the return on this investment would be limited by the short intended shelf-life of the package. These packages often have unique considerations, too, e.g., 'large' data and long build times, maintainer concerns about when the package is released relative to publication, etc. Also of interest would be commitment to the actual data storage and transfer costs and to the management costs of this type of package, coupled with appropriate consideration on scope of the repository (not just the Bioconductor cognoscenti, presumably) and advertising of availability e.g., via https://www.nature.com/sdata/policies/repositories .
> 
> Contemplating this type of package repository suggests a number of small items that provide 'cosmetic' improvements to the current situation (e.g., the removed-packages page could be organized in a tabular fashion to include from / to versions); a more meaningful attempt would probably require efforts to embrace packrat / switchr to avoid reinventing the reproducibility wheel, as well as commitment to reviewing and managing these packages for their long-term contribution. These are certainly noble goals and align with Bioconductor's emphasis on reproducibility; is this something that rises to the level of securing separate funding?
> 
> Martin
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

-- 
With thanks in advance-

Wolfgang

-------
Wolfgang Huber
Principal Investigator, EMBL Senior Scientist
European Molecular Biology Laboratory (EMBL)
Heidelberg, Germany

wolfgang.huber using embl.de
http://www.huber.embl.de

My book with Susan Holmes: http://www.huber.embl.de/msmb








More information about the Bioc-devel mailing list