[R] gzfile with multiple entries in the archive
John James
jjames at mango-solutions.com
Fri Nov 17 16:27:03 CET 2006
Following suggestions from Prof. Ripley and several others to use gzfile,
here's rough code that will unzip a tgz into your working directory and
return a list of the files. (It doesn't warn you that it is overwriting
files!)
The magic numbers refer to the current tar header specification; the block
sizes etc. are arbitrary.
It is inefficient in that it re-reads the file from the start for every
file. I couldn't get the file pointer to stay and change the readBin mode
back from 'character' to 'raw' although the reverse is used! Is there a
setting I've missed?
Also, is there a better way to do the convert(..) function?
All criticisms gratefully received, especially being pointed to an existing
function.
John James
Mango Solutions
unzip <- function(x, archiveDirectory = '.', zipExtension='tgz',
block=50000, maxBlocks=100, maxCountFiles=100) {
# Example
# unzip('test.tgz')
convert <- function(oct= 2, oldRoot=8, newRoot=10) {
if((newRoot==16))
return(structure(convert(oct, oldRoot, 10),
class='hexmode'))
if(newRoot>10)
return(simpleError('WIP'))
if(class(oct)=='hexmode') {
oct <- unclass(oct)
if(newRoot==10)
return(oct)
oldRoot <- 10
return(simpleError('WIP'))
}
oct <- as.numeric(oct)
ret <- 0
oldPower <- 1
while(oct > 0.1){
newoct <- floor(oct / newRoot)
rem <- oct - newoct * newRoot
ret <- rem * oldPower + ret
oldPower <- oldPower * oldRoot
oct <- newoct
}
if(newRoot==16)
ret <- structure(ret, class = 'hexmode')
ret
}
listOfFiles <- list()
theArchives <- list.files(archiveDirectory, pattern = zipExtension)
if(length(grep(x, theArchives))==0)
return(simpleError(paste('No archive matching *', x, '*.',
zipExtension, ' found')))
what <- paste(archiveDirectory, theArchives[grep(x, theArchives)],
sep=.Platform$file.sep)
tmp <- tempfile()
nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0
zz <- gzfile(what, 'rb')
ww <- file(tmp, 'wb')
on.exit(unlink(tmp))
while(length(mu)>0) {
if(safety > maxBlocks) {
return(simpleError(paste('Archive File too large')))
}
safety <- safety + 1
mu <- readBin(zz, 'raw', block)
writeBin(mu, ww)
}
close(zz)
close(ww)
while(countFiles < maxCountFiles){
countFiles <- countFiles + 1
zz <- file(tmp, 'rb')
stuff <- readBin(zz, 'raw', n=nextBlockStartsAt)
header <- readBin(zz, character(), n=100)
header <- header[nchar(header)>0][c(1,5)]
close(zz)
if(any(is.na(header))) {
break;
}
listOfFiles[[countFiles]] <- header[1]
zz <- file(tmp, 'rb')
body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt +
convert(header[2]))
writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1])
readUpTo <- 512 + nextBlockStartsAt + convert(header[2])
nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512
close(zz)
}
listOfFiles
}
-----Original Message-----
From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
Sent: 14 November 2006 15:18
To: John James
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] gzfile with multiple entries in the archive
On Tue, 14 Nov 2006, John James wrote:
> If I open a tgz archive with gzfile and then parse it using readLines I
miss
> the initial line of each member of the archive - and also the name of the
> file although the archive otherwise complete (but useless!).
You can use a gzfile connection to read the underlying .tar file, but that
is not a text file and you will need to pick its structure apart yourself
via readBin and readChar.
> Is there any way within R to extract both the list of files in a tgz
archive
> and to extract any one of these files?
> Clearly I can use zcat and tar on Linux, but I need this to work within
the
> R environment on Windows!
You could use tar on Windows: it is in the R tools set.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list