[Bioc-devel] BiocCheck - warning: files are over 5MB

Martin Morgan martin.morgan at roswellpark.org
Sat Mar 10 22:09:58 CET 2018



On 03/10/2018 09:03 AM, Pariksheet Nanda wrote:
> Hi Claris,
> 
> On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel <
> bioc-devel at r-project.org> wrote:
>>
>> [1] "The following files are over 5MB in size:
>> 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....."
>> This as well as other data like .gff files, that are being used
>> for the reference based assembly are all much more than 5mb.
>> But the total package size is less than 500mb.
> 
> Assuming that's not a typo, 500 mb is very large and inappropriate for a
> package.  It's generally good practice to separate code and data where
> possible, not least because it bloats code version control.  If your
> package size is close to 500 mb, you should think about stashing the data
> and accessing it using something like the AnnotationHub or BiocFileCache

yes, large files should be made available by a package that uses 
AnnotationHub or ExperimentHub for the resources. Also, it's often 
possible to re-use existing resources and, in a vignette, to 
_illustrate_ package functionality rather than redo a complete 'real' 
analysis.

See

 
http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/CreateAnAnnotationPackage.html

Martin

> (some others on the mailing list might have better and more specific
> suggestions as I've not yet had to deal with this particular problem, if
> you confirm that the package is indeed that big).
> 
> 
>> Is it essential that each file within the package is less than
>> 5mb. If so, it would be very kind if anyone could suggest how
>> to reduce the size of the genomic data files.
> 
> Can you gzip compress those data files?  Text based files usually compress
> quite well and many functions like import() from tracklayer will
> automagically decompress them so you might not even need to change much in
> your code.
> 
> .gz isn't the most disk efficient compression algorithm out there; .bz2
> compresses better and is actually what R natively uses for save() and
> load() of .RData files, and .xz typically yields even better lossless
> compression but, for cross-platform compatibility that Bioconductor strives
> for, using .gz might be best to try first.
> 
> 
>> Claris Baby
> 
> Pariksheet
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the Bioc-devel mailing list