[Bioc-devel] package size

Bazeley, Peter Peter.Bazeley at rockets.utoledo.edu
Thu Jul 22 16:23:15 CEST 2010

Sorry, I should have mentioned that that download time corresponded to a rewritten version of the function's example that used GEOquery to download the 3 GSE SuperSeries and extract the expression data from this. I think what I'm going to do is include this example in a vignette, and in the function's documentation examples, use the Dilution data instead. This is what I should have done in the first place to better integrate with existing packages.

Thanks for all your input,
From: Martin Morgan [mtmorgan at fhcrc.org]
Sent: Wednesday, July 21, 2010 9:20 PM
To: Hervé Pagès
Cc: Bazeley, Peter; Henrik Bengtsson; bioc-devel at stat.math.ethz.ch
Subject: Re: [Bioc-devel] package size

On 07/21/2010 11:16 AM, Hervé Pagès wrote:
> Hi Peter,
> On 07/20/2010 10:57 PM, Bazeley, Peter wrote:
>> Going with Martin's first suggestion, is 37 seconds to download the

My first suggestion was more along the lines of 'use ExpressionSet
rather than a data.frame or matrix'; sorry to have clouded the water.

>> data too long/inconvenient for an example in the function
>> documentation? This is for the package's main function, and the second
>> of 2 examples, with the first using a smaller/faster to load dataset.
>> The remaining code in this 2nd example takes under 8 seconds,
>> including the code to access the data in the GEOquery object.
>> Of course, the times will vary. My computer has an Intel Core 2 Duo
>> 2.8 GHz, 4GB of RAM, Windows 7.
> Download times depend more on the quality of your network connection
> than anything else. So for people with a slow internet access, those
> times could be multiplied by 5, or 10, or more...

I agree; there are lots of issues that show up on the mailing list that
trace to inability to reliably connect to sites, due to errors on the
server end, poor connectivity, local firewalls, ... And in our build
system reports packages regularly show transient internet-related
failures. This makes it difficult for the maintainer (and us!) to know
whether there's a 'real' problem or not.

An experiment data package is additional work, but in exchange you get
reproducible, reliable, and documented research.


> Cheers,
> H.
>>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> i386-pc-mingw32
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>> States.1252
>> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>> [5] LC_TIME=English_United States.1252
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> other attached packages:
>> [1] hgu95av2cdf_2.6.0   affydata_1.11.10    affy_1.26.1
>> QuantCombine_0.99.0 GEOquery_2.12.0
>> [6] RCurl_1.4-2         bitops_1.0-4.1      Biobase_2.8.0
>> loaded via a namespace (and not attached):
>> [1] affyio_1.16.0         preprocessCore_1.10.0 tools_2.11.1
>> ________________________________________
>> From: henrik.bengtsson at gmail.com [henrik.bengtsson at gmail.com] on
>> behalf of Henrik Bengtsson [hb at stat.berkeley.edu]
>> Sent: Tuesday, July 20, 2010 2:02 AM
>> To: Martin Morgan
>> Cc: Bazeley, Peter; bioc-devel at stat.math.ethz.ch
>> Subject: Re: [Bioc-devel] package size
>> Consider also package updates; even if you just do a tiny bug fix,
>> then one have do download all that data again.
>> Martin's suggestion to keep a separate experimental data package is a
>> good option.  It will also makes the data available to others to use
>> in their examples (without having to install your main package
>> dependencies), e.g. "competing" methods.
>> /Henrik
>> On Tue, Jul 20, 2010 at 6:31 AM, Martin Morgan<mtmorgan at fhcrc.org>
>> wrote:
>>> Hi Peter,
>>> On 07/19/2010 09:10 PM, Bazeley, Peter wrote:
>>>> Dear List,
>>>> I am creating a package, the purpose of which is to combine data from
>>>> different microarray platforms. I have found a NCBI GEO data series
>>>> with 3 different platforms (1 Affymetrix and 2 Illumina) that works
>>>> well for illustrating my package functions. It would be nice to keep
>>>> this data series as a data object for use in the function examples
>>>> (currently, 4 of 5 functions use this data object in their example
>>>> code) in the documentation, but the xz compressed .rda file
>>>> (consisting of 3 data frames, one for each data set) is about 5MB
>>> Hmm, but if they are expression data, then an ExpressionSet would more
>>> fully represent the data? See library(GEOquery); ?getGEO with the
>>> GSEMatrix option set to TRUE, and
>>>   http://bioconductor.org/packages/2.6/bioc/html/Biobase.html
>>> and the 'An Introduction to Biobase and ExpressionSets' vignette.
>>>> (total package size is 6MB). Is this too big?
>>>> There are 2 alternatives:
>>>> 1) The package includes a function to download datasets using the
>>>> GEOquery package, which could be used to easily re-create the data
>>>> frames included in my .rda file. The only downside is that it takes
>>>> several minutes to download all the data, so it may be inconvenient,
>>>> since this data object is used in example code for the 4 functions.
>>>> 1a) I could have each function example contain code to either a)
>>>> download the data and save it in an .RData image file, or b) load the
>>>> image file saved in a). This way the investigator would only have to
>>>> endure the download once, unless they chose not to save the data.
>>>> 2) I could take, say, the first 1000 genes from each platform. I did
>>>> this, and the combined data only has 19 probes/probesets (they are
>>>> mapped by Accession/UniGene IDs, and the common transcripts are
>>>> extracted) . It would be nice to have a larger example, although not
>>>> necessary. Alternatively, I could find a better set of 1000 (or
>>>> however many), so that more than 19 are present.
>>> A third is to create an experiment data package like those at
>>>   http://bioconductor.org/packages/release/ExperimentData.html
>>> that contains the entire data. This way you get a rich and reproducible
>>> example to illustrate your tools. These are really just packages with
>>> data objects in the inst/extdata/ (for CEL and other non-R formats) or
>>> data/ (for R data objects) directories, and man pages describing the
>>> data.
>>> Perhaps there is already an experiment data package that meets your
>>> needs?
>>> Martin
>>>> Thank you for any assistance, Peter Bazeley
>>>> _______________________________________________
>>>> Bioc-devel at stat.math.ethz.ch mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

More information about the Bioc-devel mailing list