[Bioc-devel] commit changes to data packages

Vincent Carey stvjc at channing.harvard.edu
Wed Jul 11 12:42:14 CEST 2012


README.txt should be accessible in experiment/pkgs folder; attached if not
(and if not scrubbed...) ... use  a file called external_data_store.txt in
package top folder to specify which pieces come out of separate data_store
svn folder.

On Wed, Jul 11, 2012 at 12:39 PM, Vincent Carey
<stvjc at channing.harvard.edu>wrote:

> alternative way.  i will find the doc.  basically there is a data_store
> folder parallel to pkgs and the add_data.py utility works with that
> so that software and doc can be maintained separately from voluminous data.
>
>
> On Wed, Jul 11, 2012 at 12:30 PM, Laurent Gatto <laurent.gatto at gmail.com>wrote:
>
>> Dear all,
>>
>> I have checked a complete working copy of a data package out, as
>> described in [1]. After updating infrastructure files (DESCRIPTION and
>> NEWS) and the actual data, svn status tells me the following
>> ?       data
>> M       DESCRIPTION
>> M       NEWS
>> ?       inst/extdata
>>
>> Should I 'svn add data inst/data' and then commit, as usual, or is
>> there an alternative way to explicitly commit  data files to the
>> external data_store separately?
>>
>> Thank you in advance,
>>
>> Best wishes,
>>
>> Laurent
>>
>>
>> [1] http://www.bioconductor.org/developers/source-control/
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
-------------- next part --------------
========================================
 BioC Experiment Data Package SVN Repos
========================================

:Date: 2006-08-10
:Author: S. Falcon
:svn URL: https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs


Background
==========

This svn directory contains BioC experiment data packages.  Data
packages contain potentially large binary files that do not change
often.  Most updates to these packages involve the package
infrastructure files.

Obtaining a working copy of a data package over a slow connection can
be frustrating, especially when all that is needed is the
infrastructure files and not the actual data.  We have implemented a
scheme that allows separate checkout of infrastructure files and data
files.  This document describes the scheme and provides instructions
for checkout and update of existing data packages as well as for
adding new packages.


How to Create an Infrastructure-Only Workingcopy
================================================

You can obtain a checkout of all experiment data package
infrastructure files as follows::

    svn checkout \
      https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs

To obtain the files for a particular package, say ``ALL``::

    svn checkout \
      https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/ALL

If you want to preview what is available, you might try the
following::

    # get the top-level scripts, but don't recurse into subdirs        
    svn checkout -N \
      https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs

    # see what is there
    svn ls https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs

    # get a particular package's infrastructure files
    cd pkgs
    svn up ALL
    # see next section for getting complete working copy w/ data


How to Create a Complete Workingcopy
====================================

First create a workingcopy of the infrastructure files as described
above.

Next use the helper script ``add_data.py`` (you will need Python).  It
is located here:

    https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/add_data.py

Here's a complete example for ``ALL``::

    svn checkout \
      https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/ALL

    python add_data.py ALL

This will add the big data directories (usually data/, but sometimes
also dirs under inst/) to your working copy.  Usually, the svn:ignore
property has been set so that you won't accidentally add these dirs
when working with the package, but please take care anyway.


A note about committing changes to the data
-------------------------------------------

If you want to modify the actual data, cd into the appropriate dir
after having run add_data.py and do your commit from there.  The
script adds a full working copy inside the infrastructure working
copy.


How to Add a New Data Package
=============================

1. Add the infrastructure files under ``pkgs``.

2. Add any large data directories to ../data_store/PKGNAME/.  For
   example, if there is large data in PKGNAME/data and
   PKGNAME/inst/extdata, you would add PKGNAME/data and
   PKGNAME/inst/extdata to ../data_store.

3. Create a file 'external_data_store.txt' listing each dir that is
   stored externally (each on a separate line).  Contining the example
   above, the file would contain::

        data
        inst/extdata

   This should go in the top-level of the package dir.

4. Add svn ignore properties.  Continuing the example::

       cd PKGNAME
       svn propset svn:ignore '*' ./data/ ## property 'svn:ignore' set on '.'
       	   	   	      	  	  ## in the data folder 
       svn propset svn:ignore '*' ./inst/

       or (this might not work anymore)
       svn propedit svn:ignore .   ## add 'data' here
       svn propedit svn:ignore inst  ## add 'extdata' here

5. Commit.


Details of Storage Scheme
=========================

Experiment data package infrastructure files live in
``experiment/pkgs``.  Package subdirectories that contain large files
are stored under ``experiment/data_store``.  There is no mechanism to
support separate storage of individual files.

Here is an example of how data for the ``davidTiling`` package is
stored::

    experiment/pkgs/davidTiling/
                                 DESCRIPTION
                                 NAMESPACE
                                 R/
                                 external_data_store.txt
                                 inst/
                                 man/

    experiment/data_store/davidTiling/
                                 data/
                                 inst/
                                      celfiles/
                                      website/

The ``davidTiling`` package contains large data in the ``data/``,
``inst/celfiles/``, and ``inst/website/`` subdirectories.  As you can
see, each of these is stored separately from the package
infrastructure files.  The file ``external_data_store.txt`` lists the
location of the externally stored data.  Here is the contents for
``davidTiling``::

    data
    inst/website
    inst/celfiles

To create a complete directory containing both infrastructure and
data, one first does a checkout of the infrastructure and then does a
checkout of each individual externally stored subdir.  This can be
done inside the infrastructure working copy.  There is a helper script
to automate the required svn commands.  One option that might be worth
adding to the script is to do an export instead of checkout.
Additionally, the ``svn:ignore`` property has been set in the
infrastructure dir to help prevent folks from accidentally adding the
external data to the infrastructure dir itself.


More information about the Bioc-devel mailing list