[Rd] unexpected behavior of unzip with list=T and unzip=/usr/bin/unzip

Paul Schrimpf p@ul@@chrimpf @ending from gm@il@com
Wed Jul 4 22:08:28 CEST 2018


Hello,

I encountered some unexpected behavior of unzip when using info-zip's unzip
instead of R's internal program. Specifically, unzip("file.zip", list=TRUE,
unzip=/usr/bin/unzip) produces incorrect output if the zip archive has
filenames with spaces, and results in an error if the zip archive includes
an archive comment or file comments.

Here is some code to reproduce along with the attached files

## (mostly) expected behavior
res.intern <- unzip("noSpaces.zip",list=TRUE)
res.infozip <- unzip("noSpaces.zip",list=TRUE,unzip="/usr/bin/unzip")

identical(res.intern,res.infozip) ## will be false, but expected from
                                  ## documentation about dates
identical(res.infozip$Name,res.intern$Name)     ## True
res.infozip$Length==res.intern$Length           ## TRUE
identical(res.infozip$Length,res.intern$Length) ## FALSE, because
                                                ## former numeric, later
integer

## More problematic cases
print(unzip("fileNameWithSpaces.zip",list=TRUE))
print(unzip("fileNameWithSpaces.zip",list=TRUE,unzip="/usr/bin/unzip"))
      ## read.table is used to parse output of unzip -l, and gets
      ## confused by extra spaces

print(unzip("withArchiveComment.zip",list=TRUE))
print(unzip("withArchiveComment.zip",list=TRUE,unzip="/usr/bin/unzip"))
      ## produces an error

print(unzip("entryComments.zip",list=TRUE))
print(unzip("entryComments.zip",list=TRUE,unzip="/usr/bin/unzip"))
      ## produces an error

Looking at the code for R's unzip, the basic problem is that it makes a
bunch of assumptions about the format of the output of "unzip -l"  that are
not always true and are not verified.

It's unclear to me whether R's unzip should be expected to be compatible
with all sorts of external unzip programs, so perhaps a sufficient solution
is simply to revise the documentation (which already mentions potential
problems  with dates and unzip, list=TRUE, and external programs).

Alternatively, R's unzip function could be changed to work with info-zip
unzip by :
(1) add "-ql" instead of just "-l" when list=TRUE to eliminate the printing
of comments
(2) not use read.table to parse the output of unzip, instead to something
like the following (which is an admittedly messy workaround)

            res <- if (WINDOWS)
                system2(unzip, c("-ql", shQuote(zipfile)), stdout = TRUE)
            else system2(unzip, c("-ql", shQuote(zipfile)), stdout = TRUE,
                env = c("TZ=UTC"))
            dashes <- grep("--",res)
            s <- dashes[1]+1
            l <- dashes[2]-1
            starts <- gregexpr("-+",res[dashes[1]])[[1]]
            ends <- gregexpr("[[:space:]]+",res[dashes[1]])[[1]]
            z <- data.frame(
                Name=sapply(res[s:l], function(x) {
                  substr(x, starts[4], stop=nchar(x))
                }),
                Length=sapply(res[s:l], function(x) {
                  as.numeric(substr(x, starts[1], stop=ends[1]))
                }),
                Date=sapply(res[s:l], function(x) {
                  substr(x, starts[2], stop=ends[2])
                }),
                Time=sapply(res[s:l], function(x) {
                  substr(x, starts[3], stop=ends[3])
                }),
                stringsAsFactors=FALSE
            )
            rownames(z) <- NULL

I can submit a patch if this is appropriate. I'm really not sure though
because I am new to R-devel. Also, this has the downsides of relying on the
behavior of info-zip unzip, which might change in future versions and is
unlikely to be the same for other external unzip programs. On the other
hand, the current code also relies on the behavior of info-zip unzip, but
also doesn't work in some cases.

Thanks,
Paul

P.S.

My sessionInfo is

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.1.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] devtools_1.13.5

loaded via a namespace (and not attached):
[1] compiler_3.5.1 tools_3.5.1    withr_2.1.2    memoise_1.1.0
digest_0.6.15

And unzip -v

UnZip 6.00 of 20 April 2009, by Info-ZIP.  Maintained by C. Spieler.  Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for
details.

Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.

Compiled with gcc 5.3.0 for Unix (Linux ELF) on Apr 17 2016.

UnZip special compilation options:
        ACORN_FTYPE_NFS
        COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
        SET_DIR_ATTRIB
        SYMLINKS (symbolic links supported, if RTL and file system permit)
        TIMESTAMP
        UNIXBACKUP
        USE_EF_UT_TIME
        USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
        USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
        UNICODE_SUPPORT [wide-chars, char coding: UTF-8] (handle UTF-8
paths)
        LARGE_FILE_SUPPORT (large files over 2 GiB supported)
        ZIP64_SUPPORT (archives using Zip64 for large files supported)
        USE_BZIP2 (PKZIP 4.6+, using bzip2 lib version 1.0.6, 6-Sept-2010)
        VMS_TEXT_CONV
        WILD_STOP_AT_DIR
        [decryption, version 2.11 of 05 Jan 2007]

UnZip and ZipInfo environment options:
           UNZIP:  [none]
        UNZIPOPT:  [none]
         ZIPINFO:  [none]
      ZIPINFOOPT:  [none]


More information about the R-devel mailing list