[R] memDecompress and zlib compressed base64 encoded string

Johannes Graumann johannes_graumann at web.de
Fri Jan 15 11:48:41 CET 2010


Prof Brian Ripley wrote:

>> I have zlib compressed strings (example is attached)
> 
> What is that file? Not gzip compression:
> 
> gannet% file compressed.txt
> compressed.txt: ASCII text, with very long lines
> 
> since gzip uses a magic header that 'file' knows about.  And even if
> the header was stripped, such files are 8-bit and yours is ASCII.
> Try
>> x <- 'Johannes Graumann'
>> xx <- charToRaw(x)
>> xxx <- memCompress(xx, "g")
>> rawToChar(xxx)
> [1] "x\x9c\xf3\xca\xcfH\xcc\xcbK-Vp/J,\xcd\0052\001:\n\006\x90"
> 
> to see what a real gzipped string looks like.
> 
>> and would like to decompress them using memDecompress ...
>>
>> I try this:
>>> connection <- file("compressed.txt","r")
>>> compressed <- readLines(connection)
I am dealing with mass spectrometric data in a XML file format (mzXML). The 
biggest part of the contained data is actual mass spectra that are base64 
encoded and optionally compressed using http://zlib.net (saving quite some 
storage space). When they are compressed I just get an XML node that looks 
like this
   <peaks>CONTENT OF THE ORIGINAL ATTACHMENT HERE</peaks>
I would like to be able to decompress that string and thought that 
memDecompress was the right tool to do so ...

> You have not told us the 'at a minimum' information requested in the
> posting guide.  But you should not expect that to read a binary file,
> especially not in a MBCS locale.  We have readBin for that purpose.
I'm actually reading this in as a string from the XML file ...

>>> memDecompress(as.raw(compressed),type="g")
> 
> I don't think you know what as.raw does: it does not convert bytes in
> a character string to raw (for which you need charToRaw).
> 
> It is always a good idea to look at each stage of your computation:
> 
>> as.raw(compressed)
>   [1] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
> [26] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
Yup, that was plain stupid and trying to make memDecompress run at all 
(since handing it the character string also resulted in an error.

> sessionInfo() 
R version 2.10.1 (2009-12-14) 
x86_64-pc-linux-gnu 

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rkward_0.5.1

loaded via a namespace (and not attached):
[1] tools_2.10.1

Thanks for any further hints, Joh



More information about the R-help mailing list