[Rd] Package compression benchmarks for zstd vs gzip

Heather Turner ht @end|ng |rom he@therturner@net
Wed Jan 15 10:21:59 CET 2025


With the changes to add zstd support yesterday, the build of R-devel is failing when zstd is not present, even though the docs say that zstd is optional.

The error comes in building the datasets package, see e.g. https://github.com/r-devel/r-svn/actions/runs/12760693086/job/35566530112.

Best wishes,

Heather

On Mon, Jan 13, 2025, at 1:26 AM, Simon Urbanek wrote:
> I think the first step would have to be to add zstd support to R. zstd 
> is a bit controversial (as shown by the community blowback of the 
> changes you mentioned) and their build system (calling it that is being 
> very generous) is mess so it would require a bit of testing, but it is 
> doable.
>
> That said, assuming the above is solved, we have been debating the 
> change of compression at CRAN in general for a bit, but the assumptions 
> about the file names are built into today’s tools so there would be 
> certainly some fall-out - not just in R, but also the ecosystems around 
> it. As you pointed out, possibly the safest place to start are 
> binaries, since we have tighter control of those and they are used in 
> fewer places.
>
> Personally, I think the higher priority is signing, so as we address 
> that we may just include the compression change with it since it will 
> require some tool changes anyway. I was thinking of using xz as that is 
> more stable, already supported and less controversial, but I don’t 
> think the choice really matters - it just has to be a compression which 
> R supports (zstd and xz have different benefits, so it’s always a 
> trade-off without a clear winner).
>
> Cheers,
> Simon
>
>
>> On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms using gmail.com> wrote:
>> 
>> Many distros and browsers these days use zstd as the preferred
>> compression method. For example if you unpack a .deb or .rpm file on
>> Debian or Fedora there is zstd archive inside. It is claimed that zstd
>> offers improved compression over gzip, but (unlike lzma) it has
>> comparable decompression speed. Maybe it is interesting to get an
>> estimate of how much R packages would benefit from zstd.
>> 
>> Testing this for source packages and MacOS binary packages it is easy
>> as we can gunzip and recompress tar.gz files without having to extract
>> the tarball itself:
>> 
>>  OUTPUT="sizes.txt"
>>  echo "FILE GZIP ZSTD" > $OUTPUT
>>  for x in *gz; do
>>    FILE=$(basename $x)
>>    GZIP=$(wc -c "$x" | awk '{print $1}')
>>    ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
>>    echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
>>  done
>> 
>> Attached are results of running this script on the 500 most downloaded
>> CRAN packages. It shows about 16% size reduction for sources, and 19%
>> for binaries.
>> 
>> Zstd is BSD licensed C code that can easily be embedded in any project.
>> <sources.txt><binaries.txt>______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list