[Rd] how to unbreak a circular package dependence (S4 class data)

Fri Jan 31 09:17:43 CET 2014

Kasper,

here's how I deal with a largish data set (although data + code in
one package for exactly that kind of circular dependency):

The data set is stored PCA-compressed (only the first few principal
components) in matrices plus some meta information (vector, list,
data.frame). 

I then have an internal function that reconstructs my example data:

.make.chondro <- function (){
  new ("hyperSpec",
       spc =  (tcrossprod (.chondro.scores, .chondro.loadings) +
               rep (.chondro.center, each = nrow (.chondro.scores))),
       wavelength = .chondro.wl,
       data = .chondro.extra, labels = .chondro.labels)
}

The result of that function is assigned when the data is first used:

delayedAssign ("chondro", .make.chondro ())

With that it should be possible to have the data package Suggests: the
main package, while the main package Depends: on the data (though I did
not yet find the time to separate both)

Side note: the original raw data file (compressed ASCII) is available
together with a variety of other raw data files from the project
home page - interested users find download links in the
vignettes and help pages. 

Best,

Claudia

Am Tue, 28 Jan 2014 21:21:20 -0500
schrieb Kasper Daniel Hansen <kasperdanielhansen at gmail.com>:

> This is a great comment if the primary use of the data is to make the
> data available.
> 
> It is clear that a change in the internals of the class structure
> requires changing the data package, and that is a clear drawback to my
> recommendation.  I have had to do this on several occasions.
> 
> One issue with Herve's recommendation is when the same data structure
> is used in several examples.  In that case, the conversion / parsing
> overhead multiplies by the number of examples.  As an example, in
> minfiData I have data on 6 samples on a somewhat large array.
> Parsing the raw data files for 3 of the 6 files takes 16 secs (you
> get this timing, because this is what I have in
> example(read.450k.exp)).  Loading all 6 arrays as an R data structure
> takes 1.1 sec.

> I would generally recommend that a data package either includes a
> more raw form of the data or has a script which makes the data easily
> retrievable.
> 
> Best,
> Kasper
> 
> 
> On Tue, Jan 28, 2014 at 8:01 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> 
> > Hi Daniel,
> >
> >
> > On 01/28/2014 03:49 PM, Daniel Kelley wrote:
> >
> >> I have an issue with a circular package dependence that prevents
> >> building/checking, and I seek advice on breaking the circle so the
> >> packages can pass the build-check tests that are required for CRAN
> >> submission.
> >>
> >> The package pair I'm working with is slow to build, but my tests
> >> suggest the issue may be general, and so I will explain it in
> >> general terms.
> >>
> >> Suppose there are two packages:
> >>
> >> 1. Foo, a package that defines some data types with S4 classes.
> >>
> >> 2. Foodata, a package that provides such datasets, for use by Foo.
> >>
> >> With this setup, it seems reasonable that Foo "depends" on
> >> Foodata, so the data can be used in Foo and its documentation.
> >>
> >> Since the data within Foodata are S4 classes as defined in Foo, an
> >> attempt to build-check Foodata will produce an error unless Foo is
> >> present. But Foo cannot be built unless Foodata exists, since it
> >> depends on it. Thus neither Foo nor Foodata can be built and
> >> checked.
> >>
> >
> > I've learned by experience that it's generally better (although not
> > always possible) to avoid putting serialized S4 objects in a data
> > package. They will break if you need to modify a little bit the
> > internals of the class (and chances are high that you will at some
> > point). Better to store the data in a format that is more or less
> > guaranteed to remain the same for years (SQLite, XML, hdf5, plain
> > text, serialized data frame, SAM/BAM, etc...) and try to come up
> > with a fast way to load and turn the data into an S4 object on
> > demand.
> >
> > Not always possible if the data is huge... but for the purpose of
> > using it in Foo's examples and vignette do you really need huge
> > data?
> >
> > Another advantage of this approach is that the data can then be
> > more easily shared because it can be accessed with tools other
> > than yours, e.g. tools that don't know about S4 and even non-R
> > tools.
> >
> > Cheers,
> > H.
> >
> >
> >> One solution would be to wrap the Foo documentation examples (and
> >> relevant Foo code) in require() blocks, and to make Foo "suggest"
> >> Foodata, not "depend" upon it.  My question is whether this is the
> >> recommended practice, or the common practice.
> >>
> >> Thanks in advance to anyone who wishes to offer hints.
> >>
> >> PS. The problem arose from an attempt to reduce CRAN load by
> >> extracting the datasets that had been contained within a previous
> >> version of Foo.
> >>
> >> PPS. my (slow-building) packages are on github and I can supply
> >> details if needed.
> >>
> >> Dan E. Kelley
> >> Professor, Oceanography Department
> >> Dalhousie University, Canada
> >> Dan.Kelley at Dal.CA<mailto:Dan.Kelley at Dal.CA>
> >>
> >>
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >>
> >>
> >> ______________________________________________
> >> R-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >>
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fhcrc.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> 	[[alternative HTML version deleted]]
> 

-- 
Claudia Beleites, Chemist
Spectroscopy/Imaging
Institute of Photonic Technology 
Albert-Einstein-Str. 9
07745 Jena
Germany

email: claudia.beleites at ipht-jena.de
phone: +49 3641 206-133
fax:   +49 2641 206-399