[Bioc-devel] package size

Bazeley, Peter Peter.Bazeley at rockets.utoledo.edu
Wed Jul 21 07:57:43 CEST 2010

Going with Martin's first suggestion, is 37 seconds to download the data too long/inconvenient for an example in the function documentation? This is for the package's main function, and the second of 2 examples, with the first using a smaller/faster to load dataset. The remaining code in this 2nd example takes under 8 seconds, including the code to access the data in the GEOquery object.

Of course, the times will vary. My computer has an Intel Core 2 Duo 2.8 GHz, 4GB of RAM, Windows 7.

> sessionInfo()
R version 2.11.1 (2010-05-31) 

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hgu95av2cdf_2.6.0   affydata_1.11.10    affy_1.26.1         QuantCombine_0.99.0 GEOquery_2.12.0    
[6] RCurl_1.4-2         bitops_1.0-4.1      Biobase_2.8.0      

loaded via a namespace (and not attached):
[1] affyio_1.16.0         preprocessCore_1.10.0 tools_2.11.1         

From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on behalf of Henrik Bengtsson [hb at stat.berkeley.edu]
Sent: Tuesday, July 20, 2010 2:02 AM
To: Martin Morgan
Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] package size

Consider also package updates; even if you just do a tiny bug fix,
then one have do download all that data again.

Martin's suggestion to keep a separate experimental data package is a
good option.  It will also makes the data available to others to use
in their examples (without having to install your main package
dependencies), e.g. "competing" methods.


On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Hi Peter,
> On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
>> Dear List,
>> I am creating a package, the purpose of which is to combine data from
>> different microarray platforms. I have found a NCBI GEO data series
>> with 3 different platforms (1 Affymetrix and 2 Illumina) that works
>> well for illustrating my package functions. It would be nice to keep
>> this data series as a data object for use in the function examples
>> (currently, 4 of 5 functions use this data object in their example
>> code) in the documentation, but the xz compressed .rda file
>> (consisting of 3 data frames, one for each data set) is about 5MB
> Hmm, but if they are expression data, then an ExpressionSet would more
> fully represent the data? See library(GEOquery); ?getGEO with the
> GSEMatrix option set to TRUE, and
>  http://bioconductor.org/packages/2.6/bioc/html/Biobase.html
> and the 'An Introduction to Biobase and ExpressionSets' vignette.
>> (total package size is 6MB). Is this too big?
>> There are 2 alternatives:
>> 1) The package includes a function to download datasets using the
>> GEOquery package, which could be used to easily re-create the data
>> frames included in my .rda file. The only downside is that it takes
>> several minutes to download all the data, so it may be inconvenient,
>> since this data object is used in example code for the 4 functions.
>> 1a) I could have each function example contain code to either a)
>> download the data and save it in an .RData image file, or b) load the
>> image file saved in a). This way the investigator would only have to
>> endure the download once, unless they chose not to save the data.
>> 2) I could take, say, the first 1000 genes from each platform. I did
>> this, and the combined data only has 19 probes/probesets (they are
>> mapped by Accession/UniGene IDs, and the common transcripts are
>> extracted) . It would be nice to have a larger example, although not
>> necessary. Alternatively, I could find a better set of 1000 (or
>> however many), so that more than 19 are present.
> A third is to create an experiment data package like those at
>  http://bioconductor.org/packages/release/ExperimentData.html
> that contains the entire data. This way you get a rich and reproducible
> example to illustrate your tools. These are really just packages with
> data objects in the inst/extdata/ (for CEL and other non-R formats) or
> data/ (for R data objects) directories, and man pages describing the data.
> Perhaps there is already an experiment data package that meets your needs?
> Martin
>> Thank you for any assistance, Peter Bazeley
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

More information about the Bioc-devel mailing list