[Bioc-devel] package size

Wed Jul 21 20:16:26 CEST 2010

Hi Peter,

On 07/20/2010 10:57 PM, Bazeley, Peter wrote:
> Going with Martin's first suggestion, is 37 seconds to download the data too long/inconvenient for an example in the function documentation? This is for the package's main function, and the second of 2 examples, with the first using a smaller/faster to load dataset. The remaining code in this 2nd example takes under 8 seconds, including the code to access the data in the GEOquery object.
>
> Of course, the times will vary. My computer has an Intel Core 2 Duo 2.8 GHz, 4GB of RAM, Windows 7.

Download times depend more on the quality of your network connection
than anything else. So for people with a slow internet access, those
times could be multiplied by 5, or 10, or more...

Cheers,
H.

>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] hgu95av2cdf_2.6.0   affydata_1.11.10    affy_1.26.1         QuantCombine_0.99.0 GEOquery_2.12.0
> [6] RCurl_1.4-2         bitops_1.0-4.1      Biobase_2.8.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.16.0         preprocessCore_1.10.0 tools_2.11.1
>>
>
>
>
> ________________________________________
> From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on behalf of Henrik Bengtsson [hb at stat.berkeley.edu]
> Sent: Tuesday, July 20, 2010 2:02 AM
> To: Martin Morgan
> Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch
> Subject: Re: [Bioc-devel] package size
>
> Consider also package updates; even if you just do a tiny bug fix,
> then one have do download all that data again.
>
> Martin's suggestion to keep a separate experimental data package is a
> good option.  It will also makes the data available to others to use
> in their examples (without having to install your main package
> dependencies), e.g. "competing" methods.
>
> /Henrik
>
> On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan<mtmorgan at fhcrc.org>  wrote:
>> Hi Peter,
>>
>> On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
>>> Dear List,
>>>
>>> I am creating a package, the purpose of which is to combine data from
>>> different microarray platforms. I have found a NCBI GEO data series
>>> with 3 different platforms (1 Affymetrix and 2 Illumina) that works
>>> well for illustrating my package functions. It would be nice to keep
>>> this data series as a data object for use in the function examples
>>> (currently, 4 of 5 functions use this data object in their example
>>> code) in the documentation, but the xz compressed .rda file
>>> (consisting of 3 data frames, one for each data set) is about 5MB
>>
>> Hmm, but if they are expression data, then an ExpressionSet would more
>> fully represent the data? See library(GEOquery); ?getGEO with the
>> GSEMatrix option set to TRUE, and
>>
>>   http://bioconductor.org/packages/2.6/bioc/html/Biobase.html
>>
>> and the 'An Introduction to Biobase and ExpressionSets' vignette.
>>
>>> (total package size is 6MB). Is this too big?
>>>
>>> There are 2 alternatives:
>>>
>>> 1) The package includes a function to download datasets using the
>>> GEOquery package, which could be used to easily re-create the data
>>> frames included in my .rda file. The only downside is that it takes
>>> several minutes to download all the data, so it may be inconvenient,
>>> since this data object is used in example code for the 4 functions.
>>>
>>> 1a) I could have each function example contain code to either a)
>>> download the data and save it in an .RData image file, or b) load the
>>> image file saved in a). This way the investigator would only have to
>>> endure the download once, unless they chose not to save the data.
>>>
>>> 2) I could take, say, the first 1000 genes from each platform. I did
>>> this, and the combined data only has 19 probes/probesets (they are
>>> mapped by Accession/UniGene IDs, and the common transcripts are
>>> extracted) . It would be nice to have a larger example, although not
>>> necessary. Alternatively, I could find a better set of 1000 (or
>>> however many), so that more than 19 are present.
>>
>> A third is to create an experiment data package like those at
>>
>>   http://bioconductor.org/packages/release/ExperimentData.html
>>
>> that contains the entire data. This way you get a rich and reproducible
>> example to illustrate your tools. These are really just packages with
>> data objects in the inst/extdata/ (for CEL and other non-R formats) or
>> data/ (for R data objects) directories, and man pages describing the data.
>>
>> Perhaps there is already an experiment data package that meets your needs?
>>
>> Martin
>>
>>>
>>>
>>> Thank you for any assistance, Peter Bazeley
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319