[Rd] Package compression benchmarks for zstd vs gzip
Heather Turner
ht @end|ng |rom he@therturner@net
Wed Jan 15 10:21:59 CET 2025
With the changes to add zstd support yesterday, the build of R-devel is failing when zstd is not present, even though the docs say that zstd is optional.
The error comes in building the datasets package, see e.g. https://github.com/r-devel/r-svn/actions/runs/12760693086/job/35566530112.
Best wishes,
Heather
On Mon, Jan 13, 2025, at 1:26 AM, Simon Urbanek wrote:
> I think the first step would have to be to add zstd support to R. zstd
> is a bit controversial (as shown by the community blowback of the
> changes you mentioned) and their build system (calling it that is being
> very generous) is mess so it would require a bit of testing, but it is
> doable.
>
> That said, assuming the above is solved, we have been debating the
> change of compression at CRAN in general for a bit, but the assumptions
> about the file names are built into today’s tools so there would be
> certainly some fall-out - not just in R, but also the ecosystems around
> it. As you pointed out, possibly the safest place to start are
> binaries, since we have tighter control of those and they are used in
> fewer places.
>
> Personally, I think the higher priority is signing, so as we address
> that we may just include the compression change with it since it will
> require some tool changes anyway. I was thinking of using xz as that is
> more stable, already supported and less controversial, but I don’t
> think the choice really matters - it just has to be a compression which
> R supports (zstd and xz have different benefits, so it’s always a
> trade-off without a clear winner).
>
> Cheers,
> Simon
>
>
>> On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms using gmail.com> wrote:
>>
>> Many distros and browsers these days use zstd as the preferred
>> compression method. For example if you unpack a .deb or .rpm file on
>> Debian or Fedora there is zstd archive inside. It is claimed that zstd
>> offers improved compression over gzip, but (unlike lzma) it has
>> comparable decompression speed. Maybe it is interesting to get an
>> estimate of how much R packages would benefit from zstd.
>>
>> Testing this for source packages and MacOS binary packages it is easy
>> as we can gunzip and recompress tar.gz files without having to extract
>> the tarball itself:
>>
>> OUTPUT="sizes.txt"
>> echo "FILE GZIP ZSTD" > $OUTPUT
>> for x in *gz; do
>> FILE=$(basename $x)
>> GZIP=$(wc -c "$x" | awk '{print $1}')
>> ZSTD=$(gunzip -c $x | zstd -19 | wc -c)
>> echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT
>> done
>>
>> Attached are results of running this script on the 500 most downloaded
>> CRAN packages. It shows about 16% size reduction for sources, and 19%
>> for binaries.
>>
>> Zstd is BSD licensed C code that can easily be embedded in any project.
>> <sources.txt><binaries.txt>______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list