[Rd] seek() and gzfile() on 32-bit R2.12.0 in linux

Matt Shotwell shotwelm at musc.edu
Tue Jun 22 20:44:51 CEST 2010


You used file to open "ex.gz", which ought to work, but relies on do_url
to automatically detect that the file is a gzip file. It's a long shot,
but you could try to verify that the file is a valid gzip file (R checks
that the first two bytes == "\x1f\x8b") and try the gzfile function on
the 32 bit machine and see what happens. Also, it would be nice to see
the output of your sessionInfo(), in order to reproduce your finding.

This might be a bug in the R source:
(1 - unlikely) The C function do_url (src/main/connections.c) fails to
detect the gzip file on the 32 bit machine. Unfortunately, even if
do_url does detect a gzip file, the class of the returned connection
object is still marked c("file", "connection") rather than c("gzfile",
"connection"), so there's no easy check for this. Even so, this doesn't
explain why you get 7.80707e+17.

(2 - more likely) The zlib function gztell (declared:
src/extra/zlib/zlib.h defined: src/extra/zlib/gzlib.c) returns z_off_t.
The bug may relate to the size of z_off_t on the two different machines
and/or casting z_off_t to double (which is done just before the value is
returned by gzfile_seek, defined in src/main/connections.c). What a
headache. Need to reproduce the bug to investigate this further.

I have been wondering why double was used in the prototype for the seek
member of (struct Rconn), rather than an integer type. Presumably to
solve problems such as this. I'll be very interested to see what the
core team has to say here.

-Matt

On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:
> I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15
> r52300) on my Ubuntu 10.04 64-bit system.  I observe the following behavior
> when running the examples from base::connections.  There appears to be a
> problem with seek() on a .gz file when using a 32-bit installation of
> R2.12.0, but the problem doesn't appear in the 64-bit installation.  I
> realize that seek() has been difficult in the past, and I don't want to open
> old wounds, but is this a known problem?  Is this easily fixable?  I have a
> package that relies on seek() when accessing gzipped files.
> 
> Using the 32-bit installation...
> 
> *> zz <- file("ex.data", "w")  # open an output file connection
> >      cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> >      cat("One more line\n", file = zz)
> >      close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w")  # compressed file
> >      cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> >      close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 7.80707e+17
> >
> > zz <- bzfile("ex.bz2", "w")  # bzip2-ed file
> >      cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> >      close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> >*
> 
> Using the 64-bit installation...
> 
> *> zz <- file("ex.data", "w")  # open an output file connection
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > cat("One more line\n", file = zz)
> > close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w")  # compressed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- bzfile("ex.bz2", "w")  # bzip2-ed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> > *
> 
> thanks,
> 
> Brandon
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com



More information about the R-devel mailing list