[Bioc-devel] package size

Henrik Bengtsson hb at stat.berkeley.edu
Tue Jul 20 09:02:24 CEST 2010


Consider also package updates; even if you just do a tiny bug fix,
then one have do download all that data again.

Martin's suggestion to keep a separate experimental data package is a
good option.  It will also makes the data available to others to use
in their examples (without having to install your main package
dependencies), e.g. "competing" methods.

/Henrik

On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> Hi Peter,
>
> On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
>> Dear List,
>>
>> I am creating a package, the purpose of which is to combine data from
>> different microarray platforms. I have found a NCBI GEO data series
>> with 3 different platforms (1 Affymetrix and 2 Illumina) that works
>> well for illustrating my package functions. It would be nice to keep
>> this data series as a data object for use in the function examples
>> (currently, 4 of 5 functions use this data object in their example
>> code) in the documentation, but the xz compressed .rda file
>> (consisting of 3 data frames, one for each data set) is about 5MB
>
> Hmm, but if they are expression data, then an ExpressionSet would more
> fully represent the data? See library(GEOquery); ?getGEO with the
> GSEMatrix option set to TRUE, and
>
>  http://bioconductor.org/packages/2.6/bioc/html/Biobase.html
>
> and the 'An Introduction to Biobase and ExpressionSets' vignette.
>
>> (total package size is 6MB). Is this too big?
>>
>> There are 2 alternatives:
>>
>> 1) The package includes a function to download datasets using the
>> GEOquery package, which could be used to easily re-create the data
>> frames included in my .rda file. The only downside is that it takes
>> several minutes to download all the data, so it may be inconvenient,
>> since this data object is used in example code for the 4 functions.
>>
>> 1a) I could have each function example contain code to either a)
>> download the data and save it in an .RData image file, or b) load the
>> image file saved in a). This way the investigator would only have to
>> endure the download once, unless they chose not to save the data.
>>
>> 2) I could take, say, the first 1000 genes from each platform. I did
>> this, and the combined data only has 19 probes/probesets (they are
>> mapped by Accession/UniGene IDs, and the common transcripts are
>> extracted) . It would be nice to have a larger example, although not
>> necessary. Alternatively, I could find a better set of 1000 (or
>> however many), so that more than 19 are present.
>
> A third is to create an experiment data package like those at
>
>  http://bioconductor.org/packages/release/ExperimentData.html
>
> that contains the entire data. This way you get a rich and reproducible
> example to illustrate your tools. These are really just packages with
> data objects in the inst/extdata/ (for CEL and other non-R formats) or
> data/ (for R data objects) directories, and man pages describing the data.
>
> Perhaps there is already an experiment data package that meets your needs?
>
> Martin
>
>>
>>
>> Thank you for any assistance, Peter Bazeley
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list