[Rd] warning for inefficiently compressed datasets

Hervé Pagès hpages at fhcrc.org
Wed Dec 14 01:09:00 CET 2011


Hi Uwe,

On 11-12-07 12:34 AM, Uwe Ligges wrote:
>
>
> On 06.12.2011 23:28, Hervé Pagès wrote:
>> Hi,
>>
>> Recently added to doc/NEWS.Rd:
>>
>> 'R CMD check' now gives a warning rather than a note if it finds
>> inefficiently compressed datasets. With 'bzip2' and 'xz' compression
>> having been available since R 2.10.0, there is no excuse for not
>> using them.
>>
>> Why isn't a note enough for this?
>>
>> Generally speaking, warnings are for things that are dangerous,
>> or unsafe, or unportable, or for anything that could potentially
>> cause trouble. I don't see how using gzip instead of bzip2 or xz
>> could fall into that category (and BTW gzip is the default for
>> save() and for 'R CMD build' resave-data feature).
>>
>> The problem is that bzip2 and xz compressions are slower and also
>> require more memory than gzip. Bioconductor has big data packages
>> and sometimes it makes sense to use gzip and not bzip2 or xz. For
>> example, when loading Human chromosome 1 from disk, bzip2 and xz
>> are 7 and 3.4 times slower than gzip, respectively:
>>
>> > system.time(load("chr1-gzip.rda"))
>> user system elapsed
>> 1.210 0.180 1.384
>>
>> > system.time(load("chr1-bzip2.rda"))
>> user system elapsed
>> 9.500 0.160 9.674
>>
>> > system.time(load("chr1-xz.rda"))
>> user system elapsed
>> 4.46 0.20 4.69
>>
>> hpages at latitude:~/testing$ ls -lhtr chr1-*.rda
>> -rw-r--r-- 1 hpages hpages 61M 2011-12-06 12:13 chr1-gzip.rda
>> -rw-r--r-- 1 hpages hpages 55M 2011-12-06 12:15 chr1-bzip2.rda
>> -rw-r--r-- 1 hpages hpages 49M 2011-12-06 12:25 chr1-xz.rda
>>
>> This is with R-2.14.0 on a 64-bit Ubuntu laptop with 8GB of RAM.
>>
>> The size on disk doesn't really matter and it doesn't matter either
>> that the source tarball for the full Human genome ends up being 20%
>> bigger when using gzip instead of xz: the 20% extra time it takes to
>> download it (which needs to be done only once) will largely be
>> compensated by the fact that most analyses will run faster e.g. in
>> 40-45 sec. instead of more than 2 minutes (for many short analyses,
>> loading the chromosomes into memory is the bottleneck).
>
>
> Oh, from a European side this 20% extra time may be an hour when
> downloading from the BioC master rather than a mirror.

I guess that's why we have mirrors.

> And space and traffic is an issue for CRAN.
>
>
>
>> Is there a way to turn this warning off? If not, could an option be
>> added to 'R CMD check' to turn this warning off? Something along the
>> lines of the --no-resave-data option for 'R CMD build'.
>
>
> The manual tells us:
>
> "The following environment variables can be used to customize the
> operation of check: a convenient place to set these is the file
> ‘~/.R/check.Renviron’.

Ah I see, this is in the "R Internals" manual. Good to know.

>
> [...]
>
> _R_CHECK_COMPACT_DATA2_
>
> If true, check data for ascii and uncompressed saves, and also check if
> using bzip2 or xz compression would be significantly better. Implies
> _R_CHECK_COMPACT_DATA_ is true. Default: true."

Not with current R-devel: _R_CHECK_COMPACT_DATA2_ is gone (has been 
merged with _R_CHECK_COMPACT_DATA_).
I guess we could always use _R_CHECK_COMPACT_DATA_ to turn this off
but that would mean we also turn off checking data for ascii and
uncompressed saves...

Thanks,
H.

>
>
> Uwe
>
>
>
>>
>> Thanks,
>> H.
>>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the R-devel mailing list