[Bioc-devel] BiocCheck - warning: files are over 5MB

Pariksheet Nanda pariksheet.nanda at uconn.edu
Sat Mar 10 15:03:27 CET 2018


Hi Claris,

On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel <
bioc-devel at r-project.org> wrote:
>
> [1] "The following files are over 5MB in size:
> 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....."
> This as well as other data like .gff files, that are being used
> for the reference based assembly are all much more than 5mb.
> But the total package size is less than 500mb.

Assuming that's not a typo, 500 mb is very large and inappropriate for a
package.  It's generally good practice to separate code and data where
possible, not least because it bloats code version control.  If your
package size is close to 500 mb, you should think about stashing the data
and accessing it using something like the AnnotationHub or BiocFileCache
(some others on the mailing list might have better and more specific
suggestions as I've not yet had to deal with this particular problem, if
you confirm that the package is indeed that big).


> Is it essential that each file within the package is less than
> 5mb. If so, it would be very kind if anyone could suggest how
> to reduce the size of the genomic data files.

Can you gzip compress those data files?  Text based files usually compress
quite well and many functions like import() from tracklayer will
automagically decompress them so you might not even need to change much in
your code.

.gz isn't the most disk efficient compression algorithm out there; .bz2
compresses better and is actually what R natively uses for save() and
load() of .RData files, and .xz typically yields even better lossless
compression but, for cross-platform compatibility that Bioconductor strives
for, using .gz might be best to try first.


> Claris Baby

Pariksheet

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list