[Rd] inflate zlib compressed data using base R or CRAN package?

Henrik Bengtsson hb at biostat.ucsf.edu
Fri Nov 29 10:37:20 CET 2013


On Thu, Nov 28, 2013 at 4:48 PM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
> On Nov 27, 2013, at 8:30 PM, Murray Stokely <murray at stokely.org> wrote:
>
>> I think none of these examples describe a zlib compressed data block inside a binary file that the OP asked about, as all of your examples are e.g. prepending gzip or zip headers.
>>
>> Greg, is memDecompress what you are looking for?
>>
>
> I think so.
>
> But this is interesting — I think the documentation of memCompress/memDecompress is not quite correct and the parameters are misleading. Although it does mention the gzip headers, it is incorrect since zlib format is not a subset of the gzip format (albeit they use the same compression method), so you cannot extract gzip content using zlib decompression - you’ll get  internal error -3 in memDecompress(2) if you try it since it expects the zlib header which is different form the gzip one.

Interestingly.  Just to make sure: are you 100% certain about this?
>From the http://svn.r-project.org/R/trunk/src/main/connections.c:

    case 2: /* gzip */
    {
	uLong inlen = LENGTH(from), outlen = 3*inlen;
	int res;
	Bytef *buf, *p = (Bytef *)RAW(from);
	/* we check for a file header */
	if (p[0] == 0x1f && p[1] == 0x8b) { p += 2; inlen -= 2; }
	while(1) {
	    buf = (Bytef *) R_alloc(outlen, sizeof(Bytef));
	    res = uncompress(buf, &outlen, p, inlen);
	    if(res == Z_BUF_ERROR) { outlen *= 2; continue; }
	    if(res == Z_OK) break;
	    error("internal error %d in memDecompress(%d)", res, type);
	}
	ans = allocVector(RAWSXP, outlen);
	memcpy(RAW(ans), buf, outlen);
	break;
    }

That code looks for the 0x1F 0x8B magic number, which is the one for
gzip [http://www.gzip.org/zlib/rfc-gzip.html#header-trailer].  Or are
you saying that that if statement is incorrect?  (Disclaimer: I don't
know much about gzip/zlib, but I happens to recognize that gzip magic
number.)

/Henrik

> So “gzip” in type is a misnomer - it should say “zlib” since it can neither read nor write the gzip format. Also the documentation should make it clear since it’s pointless to try to use this on gzip contents. The better alternative would be to support both gzip and zlib since R can deal with both — the issue is that it will break code that used type=“gzip” explicitly to mean “zlib” so I’m not sure there is a good way out.
>
> Cheers,
> Simon
>
>
>>
>> On Wed, Nov 27, 2013 at 5:22 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
>>
>>>
>>> On 27 November 2013 at 18:38, Dirk Eddelbuettel wrote:
>>> |
>>> | On 27 November 2013 at 23:49, Dr Gregory Jefferis wrote:
>>> | | I have a binary file type that includes a zlib compressed data block
>>> (ie
>>> | | not gzip). Is anyone aware of a way using base R or a CRAN package to
>>> | | decompress this kind of data (from disk or memory). So far I have found
>>> | | Rcompression::decompress on omegahat, but I would prefer to keep
>>> | | dependencies on CRAN (or bioconductor). I am also trying to avoid
>>> | | writing yet another C level interface to part of zlib.
>>> |
>>> | Unless I am missing something, this is in base R; see help(connections).
>>> |
>>> | Here is a quick demo:
>>> |
>>> | R> write.csv(trees, file="/tmp/trees.csv")    # data we all have
>>> | R> system("gzip -v /tmp/trees.csv")           # as I am lazy here
>>> | /tmp/trees.csv:        50.5% -- replaced with /tmp/trees.csv.gz
>>> | R> read.csv(gzfile("/tmp/trees.csv.gz"))      # works out of the box
>>>
>>> Oh, and in case you meant zip file containing a data file, that also works.
>>>
>>> First converting what I did last
>>>
>>> edd at max:/tmp$ gunzip trees.csv.gz
>>> edd at max:/tmp$ zip trees.zip trees.csv
>>>  adding: trees.csv (deflated 50%)
>>> edd at max:/tmp$
>>>
>>> Then reading the csv from inside the zip file:
>>>
>>> R> read.csv(unz("/tmp/trees.zip", "trees.csv"))
>>>    X Girth Height Volume
>>> 1   1   8.3     70   10.3
>>> 2   2   8.6     65   10.3
>>> 3   3   8.8     63   10.2
>>> 4   4  10.5     72   16.4
>>> 5   5  10.7     81   18.8
>>> 6   6  10.8     83   19.7
>>> 7   7  11.0     66   15.6
>>> 8   8  11.0     75   18.2
>>> 9   9  11.1     80   22.6
>>> 10 10  11.2     75   19.9
>>> 11 11  11.3     79   24.2
>>> 12 12  11.4     76   21.0
>>> 13 13  11.4     76   21.4
>>> 14 14  11.7     69   21.3
>>> 15 15  12.0     75   19.1
>>> 16 16  12.9     74   22.2
>>> 17 17  12.9     85   33.8
>>> 18 18  13.3     86   27.4
>>> 19 19  13.7     71   25.7
>>> 20 20  13.8     64   24.9
>>> 21 21  14.0     78   34.5
>>> 22 22  14.2     80   31.7
>>> 23 23  14.5     74   36.3
>>> 24 24  16.0     72   38.3
>>> 25 25  16.3     77   42.6
>>> 26 26  17.3     81   55.4
>>> 27 27  17.5     82   55.7
>>> 28 28  17.9     80   58.3
>>> 29 29  18.0     80   51.5
>>> 30 30  18.0     80   51.0
>>> 31 31  20.6     87   77.0
>>> R>
>>>
>>> Regards, Dirk
>>>
>>> --
>>> Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list