[Rd] inflate zlib compressed data using base R or CRAN package?

Fri Nov 29 16:17:12 CET 2013

On Nov 29, 2013, at 4:37 AM, Henrik Bengtsson <hb at biostat.ucsf.edu> wrote:

> On Thu, Nov 28, 2013 at 4:48 PM, Simon Urbanek
> <simon.urbanek at r-project.org> wrote:
>> On Nov 27, 2013, at 8:30 PM, Murray Stokely <murray at stokely.org> wrote:
>> 
>>> I think none of these examples describe a zlib compressed data block inside a binary file that the OP asked about, as all of your examples are e.g. prepending gzip or zip headers.
>>> 
>>> Greg, is memDecompress what you are looking for?
>>> 
>> 
>> I think so.
>> 
>> But this is interesting — I think the documentation of memCompress/memDecompress is not quite correct and the parameters are misleading. Although it does mention the gzip headers, it is incorrect since zlib format is not a subset of the gzip format (albeit they use the same compression method), so you cannot extract gzip content using zlib decompression - you’ll get  internal error -3 in memDecompress(2) if you try it since it expects the zlib header which is different form the gzip one.
> 
> Interestingly.  Just to make sure: are you 100% certain about this?

Yes, see below.

>> From the http://svn.r-project.org/R/trunk/src/main/connections.c:
> 
>    case 2: /* gzip */
>    {
> 	uLong inlen = LENGTH(from), outlen = 3*inlen;
> 	int res;
> 	Bytef *buf, *p = (Bytef *)RAW(from);
> 	/* we check for a file header */
> 	if (p[0] == 0x1f && p[1] == 0x8b) { p += 2; inlen -= 2; }
> 	while(1) {
> 	    buf = (Bytef *) R_alloc(outlen, sizeof(Bytef));
> 	    res = uncompress(buf, &outlen, p, inlen);
> 	    if(res == Z_BUF_ERROR) { outlen *= 2; continue; }
> 	    if(res == Z_OK) break;
> 	    error("internal error %d in memDecompress(%d)", res, type);
> 	}
> 	ans = allocVector(RAWSXP, outlen);
> 	memcpy(RAW(ans), buf, outlen);
> 	break;
>    }
> 
> That code looks for the 0x1F 0x8B magic number, which is the one for
> gzip [http://www.gzip.org/zlib/rfc-gzip.html#header-trailer].  Or are
> you saying that that if statement is incorrect?  (Disclaimer: I don't
> know much about gzip/zlib, but I happens to recognize that gzip magic
> number.)
> 

The above assumes that zlib is a subset of gzip which is *not* true - that was the point I was making. zlibs has *different* headers than gzip, not just fewer bytes. gzip has lots of other things in the header and they even also use different CRC methods. 

To illustrate:

> writeBin(charToRaw("1234"), f<-gzfile("test.gz","wb"))
> close(f)
> readBin("test.gz",raw(),100)
 [1] 1f 8b 08 00 00 00 00 00 00 03 33 34 32 36 01
[16] 00 a3 e0 e3 9b 04 00 00 00
> memCompress("1234")
 [1] 78 9c 33 34 32 36 01 00 01 f8 00 cb

As you can see gzip uses a different header (it starts with 0x1f 0x8b but then has many other files like mod time etc.) - the compressed payload starts at byte 11 and the CRC is 64-bit wide. In contrast, zlib has no magic header but it also has just two-byte header followed by the payload (starting at byte 3) and 32-bit CRC. So the two are entirely incompatible - you cannot decompress gzip format with zlib parser and vice-versa. The payload is the same, but the headers and trailers are entirely different. That's why Greg was specifically asking about zlib which does *not* mean gzip.

Cheers,
Simon

> /Henrik
> 
>> So “gzip” in type is a misnomer - it should say “zlib” since it can neither read nor write the gzip format. Also the documentation should make it clear since it’s pointless to try to use this on gzip contents. The better alternative would be to support both gzip and zlib since R can deal with both — the issue is that it will break code that used type=“gzip” explicitly to mean “zlib” so I’m not sure there is a good way out.
>> 
>> Cheers,
>> Simon
>> 
>> 
>>> 
>>> On Wed, Nov 27, 2013 at 5:22 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
>>> 
>>>> 
>>>> On 27 November 2013 at 18:38, Dirk Eddelbuettel wrote:
>>>> |
>>>> | On 27 November 2013 at 23:49, Dr Gregory Jefferis wrote:
>>>> | | I have a binary file type that includes a zlib compressed data block
>>>> (ie
>>>> | | not gzip). Is anyone aware of a way using base R or a CRAN package to
>>>> | | decompress this kind of data (from disk or memory). So far I have found
>>>> | | Rcompression::decompress on omegahat, but I would prefer to keep
>>>> | | dependencies on CRAN (or bioconductor). I am also trying to avoid
>>>> | | writing yet another C level interface to part of zlib.
>>>> |
>>>> | Unless I am missing something, this is in base R; see help(connections).
>>>> |
>>>> | Here is a quick demo:
>>>> |
>>>> | R> write.csv(trees, file="/tmp/trees.csv")    # data we all have
>>>> | R> system("gzip -v /tmp/trees.csv")           # as I am lazy here
>>>> | /tmp/trees.csv:        50.5% -- replaced with /tmp/trees.csv.gz
>>>> | R> read.csv(gzfile("/tmp/trees.csv.gz"))      # works out of the box
>>>> 
>>>> Oh, and in case you meant zip file containing a data file, that also works.
>>>> 
>>>> First converting what I did last
>>>> 
>>>> edd at max:/tmp$ gunzip trees.csv.gz
>>>> edd at max:/tmp$ zip trees.zip trees.csv
>>>> adding: trees.csv (deflated 50%)
>>>> edd at max:/tmp$
>>>> 
>>>> Then reading the csv from inside the zip file:
>>>> 
>>>> R> read.csv(unz("/tmp/trees.zip", "trees.csv"))
>>>>   X Girth Height Volume
>>>> 1   1   8.3     70   10.3
>>>> 2   2   8.6     65   10.3
>>>> 3   3   8.8     63   10.2
>>>> 4   4  10.5     72   16.4
>>>> 5   5  10.7     81   18.8
>>>> 6   6  10.8     83   19.7
>>>> 7   7  11.0     66   15.6
>>>> 8   8  11.0     75   18.2
>>>> 9   9  11.1     80   22.6
>>>> 10 10  11.2     75   19.9
>>>> 11 11  11.3     79   24.2
>>>> 12 12  11.4     76   21.0
>>>> 13 13  11.4     76   21.4
>>>> 14 14  11.7     69   21.3
>>>> 15 15  12.0     75   19.1
>>>> 16 16  12.9     74   22.2
>>>> 17 17  12.9     85   33.8
>>>> 18 18  13.3     86   27.4
>>>> 19 19  13.7     71   25.7
>>>> 20 20  13.8     64   24.9
>>>> 21 21  14.0     78   34.5
>>>> 22 22  14.2     80   31.7
>>>> 23 23  14.5     74   36.3
>>>> 24 24  16.0     72   38.3
>>>> 25 25  16.3     77   42.6
>>>> 26 26  17.3     81   55.4
>>>> 27 27  17.5     82   55.7
>>>> 28 28  17.9     80   58.3
>>>> 29 29  18.0     80   51.5
>>>> 30 30  18.0     80   51.0
>>>> 31 31  20.6     87   77.0
>>>> R>
>>>> 
>>>> Regards, Dirk
>>>> 
>>>> --
>>>> Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>> 
>>>      [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>