[Bioc-devel] package size

Wed Jul 21 07:57:43 CEST 2010

Going with Martin's first suggestion, is 37 seconds to download the data too long/inconvenient for an example in the function documentation? This is for the package's main function, and the second of 2 examples, with the first using a smaller/faster to load dataset. The remaining code in this 2nd example takes under 8 seconds, including the code to access the data in the GEOquery object.

Of course, the times will vary. My computer has an Intel Core 2 Duo 2.8 GHz, 4GB of RAM, Windows 7.

> sessionInfo()
R version 2.11.1 (2010-05-31) 
i386-pc-mingw32 

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hgu95av2cdf_2.6.0   affydata_1.11.10    affy_1.26.1         QuantCombine_0.99.0 GEOquery_2.12.0    
[6] RCurl_1.4-2         bitops_1.0-4.1      Biobase_2.8.0      

loaded via a namespace (and not attached):
[1] affyio_1.16.0         preprocessCore_1.10.0 tools_2.11.1         
> 

________________________________________
From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on behalf of Henrik Bengtsson [hb at stat.berkeley.edu]
Sent: Tuesday, July 20, 2010 2:02 AM
To: Martin Morgan
Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] package size

Consider also package updates; even if you just do a tiny bug fix,
then one have do download all that data again.

Martin's suggestion to keep a separate experimental data package is a
good option.  It will also makes the data available to others to use
in their examples (without having to install your main package
dependencies), e.g. "competing" methods.

/Henrik

On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Hi Peter,
>
> On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
>> Dear List,
>>
>> I am creating a package, the purpose of which is to combine data from
>> different microarray platforms. I have found a NCBI GEO data series
>> with 3 different platforms (1 Affymetrix and 2 Illumina) that works
>> well for illustrating my package functions. It would be nice to keep
>> this data series as a data object for use in the function examples
>> (currently, 4 of 5 functions use this data object in their example
>> code) in the documentation, but the xz compressed .rda file
>> (consisting of 3 data frames, one for each data set) is about 5MB
>
> Hmm, but if they are expression data, then an ExpressionSet would more
> fully represent the data? See library(GEOquery); ?getGEO with the
> GSEMatrix option set to TRUE, and
>
>  http://bioconductor.org/packages/2.6/bioc/html/Biobase.html
>
> and the 'An Introduction to Biobase and ExpressionSets' vignette.
>
>> (total package size is 6MB). Is this too big?
>>
>> There are 2 alternatives:
>>
>> 1) The package includes a function to download datasets using the
>> GEOquery package, which could be used to easily re-create the data
>> frames included in my .rda file. The only downside is that it takes
>> several minutes to download all the data, so it may be inconvenient,
>> since this data object is used in example code for the 4 functions.
>>
>> 1a) I could have each function example contain code to either a)
>> download the data and save it in an .RData image file, or b) load the
>> image file saved in a). This way the investigator would only have to
>> endure the download once, unless they chose not to save the data.
>>
>> 2) I could take, say, the first 1000 genes from each platform. I did
>> this, and the combined data only has 19 probes/probesets (they are
>> mapped by Accession/UniGene IDs, and the common transcripts are
>> extracted) . It would be nice to have a larger example, although not
>> necessary. Alternatively, I could find a better set of 1000 (or
>> however many), so that more than 19 are present.
>
> A third is to create an experiment data package like those at
>
>  http://bioconductor.org/packages/release/ExperimentData.html
>
> that contains the entire data. This way you get a rich and reproducible
> example to illustrate your tools. These are really just packages with
> data objects in the inst/extdata/ (for CEL and other non-R formats) or
> data/ (for R data objects) directories, and man pages describing the data.
>
> Perhaps there is already an experiment data package that meets your needs?
>
> Martin
>
>>
>>
>> Thank you for any assistance, Peter Bazeley
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>