[Rd] Package compression benchmarks for zstd vs gzip

Simon Urbanek @|mon@urb@nek @end|ng |rom R-project@org
Wed Jan 15 13:30:11 CET 2025


Heather,

thanks, now fixed (datasets was using numeric value for compress= instead of the compression name so it picked zstd instead of gzip - now the switch order is kept the same).

Cheers,
Simon


> On Jan 15, 2025, at 10:21 PM, Heather Turner <ht using heatherturner.net> wrote:
> 
> With the changes to add zstd support yesterday, the build of R-devel is failing when zstd is not present, even though the docs say that zstd is optional.
> 
> The error comes in building the datasets package, see e.g. https://github.com/r-devel/r-svn/actions/runs/12760693086/job/35566530112.
> 
> Best wishes,
> 
> Heather
> 
> On Mon, Jan 13, 2025, at 1:26 AM, Simon Urbanek wrote:
>> I think the first step would have to be to add zstd support to R. zstd 
>> is a bit controversial (as shown by the community blowback of the 
>> changes you mentioned) and their build system (calling it that is being 
>> very generous) is mess so it would require a bit of testing, but it is 
>> doable.
>> 
>> That said, assuming the above is solved, we have been debating the 
>> change of compression at CRAN in general for a bit, but the assumptions 
>> about the file names are built into today’s tools so there would be 
>> certainly some fall-out - not just in R, but also the ecosystems around 
>> it. As you pointed out, possibly the safest place to start are 
>> binaries, since we have tighter control of those and they are used in 
>> fewer places.
>> 
>> Personally, I think the higher priority is signing, so as we address 
>> that we may just include the compression change with it since it will 
>> require some tool changes anyway. I was thinking of using xz as that is 
>> more stable, already supported and less controversial, but I don’t 
>> think the choice really matters - it just has to be a compression which 
>> R supports (zstd and xz have different benefits, so it’s always a 
>> trade-off without a clear winner).
>> 
>> Cheers,
>> Simon
>> 
>> 
>>> On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms using gmail.com> wrote:
>>> 
>>> Many distros and browsers these days use zstd as the preferred
>>> compression method. For example if you unpack a .deb or .rpm file on
>>> Debian or Fedora there is zstd archive inside. It is claimed that zstd
>>> offers improved compression over gzip, but (unlike lzma) it has
>>> comparable decompression speed. Maybe it is interesting to get an
>>> estimate of how much R packages would benefit from zstd.
>>> 
>>> Testing this for source packages and MacOS binary packages it is easy
>>> as we can gunzip and recompress tar.gz files without having to extract
>>> the tarball itself:
>>> 
>>> OUTPUT="sizes.txt"
>>> echo "FILE GZIP ZSTD" > $OUTPUT
>>> for x in *gz; do
>>>   FILE=$(basename $x)
>>>   GZIP=$(wc -c "$x" | awk '{print $1}')
>>>   ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
>>>   echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
>>> done
>>> 
>>> Attached are results of running this script on the 500 most downloaded
>>> CRAN packages. It shows about 16% size reduction for sources, and 19%
>>> for binaries.
>>> 
>>> Zstd is BSD licensed C code that can easily be embedded in any project.
>>> <sources.txt><binaries.txt>______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 



More information about the R-devel mailing list