[Bioc-devel] Package size limitation

Henrik Bengtsson hb at stat.berkeley.edu
Fri Aug 1 19:07:43 CEST 2008


Avoid putting large data sets in otherwise small packages (how big is
your package without data?).  Put large example data in a separate
experiment data package which is optional to load.  Try to minimize
the amount of download and the number of dependent package that other
users/developers needed to actually use your new method.  That
increase the chances that your method is used elsewhere as well.
Updates will be faster to install.

If you put together an experiment data package, please consider using
the CEL files and not AffyBatch packages.  The AffyBatch structure
might be obsolete one day and your experiment package with it.  This
is less likely to happen if you use CEL files - the most common
denominator for all data structure/classes.

Now to a trick: If you do want to distribute an AffyBatch object, have
a look at your intensities.  If your chip type is a 3x3 pixel per
probe array, and the Affymetrix image analysis (typically) took the
75% quantile (7:th ordered pixel), you will actually see only integer
probe signals.  Note, this is not a rounding error but it just happens
"by chance".  If this is the case with your data, you could create an
object holding the signals as integers and not doubles without loosing
anything.  That object would be roughly half the size.  I don't think
compression algorithms can pick this up.



On Fri, Aug 1, 2008 at 9:27 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
> Tobias,
> To elaborate on Kasper's well-stated points, the Bioconductor project has
> separate repositories for software, experiment data, and annotation
> metadata.
> BioC software:  http://bioconductor.org/packages/release/bioc/
> BioC experiment data:
>  http://bioconductor.org/packages/release/data/experiment/
> BioC annotation metadata:
>  http://bioconductor.org/packages/release/data/annotation/
> The main criterion for an experiment data package is that it should be novel
> in some way to make it useful for other software developers to utilize it
> when illustrating concepts in their software package. More often than not,
> including a subset of your data in your software package will suffice. One
> common misconception by new package developers is that examples in the man
> pages and the vignettes need to be 100% "real". The main goal of vignettes
> and man pages is to illustrate concepts rather than reveal scientific
> findings. You are encouraged to provide references to scientific papers that
> demonstrate the latter within your software's documentation, but typically
> end-users want your package to have a small storage footprint on their
> machine and have the examples run in a short time frame.
> If you are not sure how to handle your particular situation, when the
> Bioconductor team previews and reviews your package, we will help you
> through any tricky decisions. Good luck with your package submission and
> thanks for your interest in the Bioconductor project!
> Patrick
> Kasper Daniel Hansen wrote:
>> I believe the right way to do this is the following
>> 1) Make an AffyBatch containing only a few probesets to use for your
>> examples. Include this in the package
>> 2) Submit a package containing your experiment
>> 3) Use the package above in a vignette.
>> But I am sure someone else will chime in.
>> Kasper
>> On Jul 31, 2008, at 8:22 PM, Tobias Guennel wrote:
>>> Dear all,
>>> I'm writing my first R package that I want to submit to Bioconductor
>>> implementing an algorithm for detecting differentially expressed genes
>>> that
>>> is not available in R yet.
>>> The example data set is an AffyBatch object that contains 6 read in *.CEL
>>> files, which were used in the paper introducing the algorithm and are
>>> most
>>> suitable to show the application of the package. Unfortunately, including
>>> the data example increases the size of the compressed package to 6MB.
>>> Is there a way to drastically reduce the size of an Affybatch object or
>>> are
>>> there other options that can reduce the file size?
>>> Thanks for your help,
>>> ------------------------------
>>> Tobias Guennel
>>> Research Assistant
>>> Department of Biostatistics
>>> Virginia Commonwealth University
>>> Theater Row 3035F
>>> 804-828-2527
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

More information about the Bioc-devel mailing list