[Bioc-devel] BiocCheck - warning: files are over 5MB
Martin Morgan
martin.morgan at roswellpark.org
Sat Mar 10 22:09:58 CET 2018
On 03/10/2018 09:03 AM, Pariksheet Nanda wrote:
> Hi Claris,
>
> On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel <
> bioc-devel at r-project.org> wrote:
>>
>> [1] "The following files are over 5MB in size:
>> 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'....."
>> This as well as other data like .gff files, that are being used
>> for the reference based assembly are all much more than 5mb.
>> But the total package size is less than 500mb.
>
> Assuming that's not a typo, 500 mb is very large and inappropriate for a
> package. It's generally good practice to separate code and data where
> possible, not least because it bloats code version control. If your
> package size is close to 500 mb, you should think about stashing the data
> and accessing it using something like the AnnotationHub or BiocFileCache
yes, large files should be made available by a package that uses
AnnotationHub or ExperimentHub for the resources. Also, it's often
possible to re-use existing resources and, in a vignette, to
_illustrate_ package functionality rather than redo a complete 'real'
analysis.
See
http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/CreateAnAnnotationPackage.html
Martin
> (some others on the mailing list might have better and more specific
> suggestions as I've not yet had to deal with this particular problem, if
> you confirm that the package is indeed that big).
>
>
>> Is it essential that each file within the package is less than
>> 5mb. If so, it would be very kind if anyone could suggest how
>> to reduce the size of the genomic data files.
>
> Can you gzip compress those data files? Text based files usually compress
> quite well and many functions like import() from tracklayer will
> automagically decompress them so you might not even need to change much in
> your code.
>
> .gz isn't the most disk efficient compression algorithm out there; .bz2
> compresses better and is actually what R natively uses for save() and
> load() of .RData files, and .xz typically yields even better lossless
> compression but, for cross-platform compatibility that Bioconductor strives
> for, using .gz might be best to try first.
>
>
>> Claris Baby
>
> Pariksheet
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
This email message may contain legally privileged and/or...{{dropped:2}}
More information about the Bioc-devel
mailing list