[Rd] Package compression benchmarks for zstd vs gzip
Simon Urbanek
@|mon@urb@nek @end|ng |rom R-project@org
Wed Jan 15 13:30:11 CET 2025
Heather,
thanks, now fixed (datasets was using numeric value for compress= instead of the compression name so it picked zstd instead of gzip - now the switch order is kept the same).
Cheers,
Simon
> On Jan 15, 2025, at 10:21 PM, Heather Turner <ht using heatherturner.net> wrote:
>
> With the changes to add zstd support yesterday, the build of R-devel is failing when zstd is not present, even though the docs say that zstd is optional.
>
> The error comes in building the datasets package, see e.g. https://github.com/r-devel/r-svn/actions/runs/12760693086/job/35566530112.
>
> Best wishes,
>
> Heather
>
> On Mon, Jan 13, 2025, at 1:26 AM, Simon Urbanek wrote:
>> I think the first step would have to be to add zstd support to R. zstd
>> is a bit controversial (as shown by the community blowback of the
>> changes you mentioned) and their build system (calling it that is being
>> very generous) is mess so it would require a bit of testing, but it is
>> doable.
>>
>> That said, assuming the above is solved, we have been debating the
>> change of compression at CRAN in general for a bit, but the assumptions
>> about the file names are built into today’s tools so there would be
>> certainly some fall-out - not just in R, but also the ecosystems around
>> it. As you pointed out, possibly the safest place to start are
>> binaries, since we have tighter control of those and they are used in
>> fewer places.
>>
>> Personally, I think the higher priority is signing, so as we address
>> that we may just include the compression change with it since it will
>> require some tool changes anyway. I was thinking of using xz as that is
>> more stable, already supported and less controversial, but I don’t
>> think the choice really matters - it just has to be a compression which
>> R supports (zstd and xz have different benefits, so it’s always a
>> trade-off without a clear winner).
>>
>> Cheers,
>> Simon
>>
>>
>>> On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms using gmail.com> wrote:
>>>
>>> Many distros and browsers these days use zstd as the preferred
>>> compression method. For example if you unpack a .deb or .rpm file on
>>> Debian or Fedora there is zstd archive inside. It is claimed that zstd
>>> offers improved compression over gzip, but (unlike lzma) it has
>>> comparable decompression speed. Maybe it is interesting to get an
>>> estimate of how much R packages would benefit from zstd.
>>>
>>> Testing this for source packages and MacOS binary packages it is easy
>>> as we can gunzip and recompress tar.gz files without having to extract
>>> the tarball itself:
>>>
>>> OUTPUT="sizes.txt"
>>> echo "FILE GZIP ZSTD" > $OUTPUT
>>> for x in *gz; do
>>> FILE=$(basename $x)
>>> GZIP=$(wc -c "$x" | awk '{print $1}')
>>> ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
>>> echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
>>> done
>>>
>>> Attached are results of running this script on the 500 most downloaded
>>> CRAN packages. It shows about 16% size reduction for sources, and 19%
>>> for binaries.
>>>
>>> Zstd is BSD licensed C code that can easily be embedded in any project.
>>> <sources.txt><binaries.txt>______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list