[Bioc-devel] package size

Martin Morgan mtmorgan at fhcrc.org
Tue Jul 20 06:31:57 CEST 2010


Hi Peter,

On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
> Dear List,
> 
> I am creating a package, the purpose of which is to combine data from
> different microarray platforms. I have found a NCBI GEO data series
> with 3 different platforms (1 Affymetrix and 2 Illumina) that works
> well for illustrating my package functions. It would be nice to keep
> this data series as a data object for use in the function examples
> (currently, 4 of 5 functions use this data object in their example
> code) in the documentation, but the xz compressed .rda file
> (consisting of 3 data frames, one for each data set) is about 5MB

Hmm, but if they are expression data, then an ExpressionSet would more
fully represent the data? See library(GEOquery); ?getGEO with the
GSEMatrix option set to TRUE, and

  http://bioconductor.org/packages/2.6/bioc/html/Biobase.html

and the 'An Introduction to Biobase and ExpressionSets' vignette.

> (total package size is 6MB). Is this too big?
> 
> There are 2 alternatives:
> 
> 1) The package includes a function to download datasets using the
> GEOquery package, which could be used to easily re-create the data
> frames included in my .rda file. The only downside is that it takes
> several minutes to download all the data, so it may be inconvenient,
> since this data object is used in example code for the 4 functions.
> 
> 1a) I could have each function example contain code to either a)
> download the data and save it in an .RData image file, or b) load the
> image file saved in a). This way the investigator would only have to
> endure the download once, unless they chose not to save the data.
> 
> 2) I could take, say, the first 1000 genes from each platform. I did
> this, and the combined data only has 19 probes/probesets (they are
> mapped by Accession/UniGene IDs, and the common transcripts are
> extracted) . It would be nice to have a larger example, although not
> necessary. Alternatively, I could find a better set of 1000 (or
> however many), so that more than 19 are present.

A third is to create an experiment data package like those at

  http://bioconductor.org/packages/release/ExperimentData.html

that contains the entire data. This way you get a rich and reproducible
example to illustrate your tools. These are really just packages with
data objects in the inst/extdata/ (for CEL and other non-R formats) or
data/ (for R data objects) directories, and man pages describing the data.

Perhaps there is already an experiment data package that meets your needs?

Martin

> 
> 
> Thank you for any assistance, Peter Bazeley 
> _______________________________________________ 
> Bioc-devel at stat.math.ethz.ch mailing list 
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list