[Rd] seek() and gzfile() on 32-bit R2.12.0 in linux

Matt Shotwell shotwelm at musc.edu
Wed Jun 23 05:41:24 CEST 2010


I was able to reproduce this bug. After some investigating, it's clearly
localized to gztell (a zlib function), and the z_off_t type. However,
there may be a broader cross-compiling problem. I don't know what
procedure Brandon used to compile the 32 bit version (I used the gcc
-m32 flag), but we should be sure that we're doing this correctly (and
document it!) before going on a goose chase. The real issue may or may
not be related to zlib, but only manifested there. Discussion of my
findings are below.

-Matt

I checked to ensure that R's file function was recognizing the gzip file
as such. So that's not the problem. I next modified some code in
gzfile_seek, just above and below the call to gztell (line 1230 of
connections.c), and defined a small function z_off_t_print, to print the
bits of the z_off_t offset in least significant order (assuming little
endian):

static void z_off_t_print(z_off_t)
{
    z_off_t mask = 1; 
    while( mask > 0 ) {
        printf("%u", (mask & u) > 0 ); 
        mask <<= 1;
    }
    printf("\n");
}

static double gzfile_seek(Rconnection con, double where, int origin, int rw)
{
    gzFile  fp = ((Rgzfileconn)(con->private))->fp;

    /** begin modified code **/
    z_off_t pos;
    printf("sizeof(z_off_t): %u\n", sizeof(z_off_t));
    printf("sizeof(double): %u\n", sizeof(double));
    printf("before gztell():\n");
    z_off_t_print(pos);
    pos = gztell(fp);
    printf("after gztell():\n");
    z_off_t_print(pos);
    printf("(double) pos: %f\n", (double) pos);

    /** end modified code **/
    ...

Here's what happens running code similar to yours in the 32 bit build:

> zz <- gzfile("ex.gz", "w")  # compressed file
> cat("TITLE extra line", "2 3 5 7", 
+     "", "11 13 17", file = zz, sep = "\n")
> close(zz)
> blah = file("ex.gz", "r")
> seek(blah, 5)
sizeof(z_off_t): 8
sizeof(double): 8
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
000000000000000000000000000000000000110000111011110111001001000
(double) pos: 665367468683821056.000000
[1] 6.653675e+17
> seek(blah)
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
101000000000000000000000000000000000110000111011110111001001000
(double) pos: 665367468683821056.000000
[1] 6.653675e+17

Hence, gztell is doing what we expect in the least significant 32 bits
(which is binary for decimal 5), but returns junk in the most
significant 32 bits. Here are the results for the 64 bit build: 

> zz <- gzfile("ex.gz", "w")  # compressed file
> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> close(zz)
> blah = file("ex.gz", "r")
> seek(blah, 5)
sizeof(z_off_t): 8
sizeof(double): 8
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
000000000000000000000000000000000000000000000000000000000000000
(double) pos: 0.000000
[1] 0
> seek(blah)
before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
101000000000000000000000000000000000000000000000000000000000000
(double) pos: 5.000000
[1] 5

No problems with the 64 bit build.

On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:
> I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15
> r52300) on my Ubuntu 10.04 64-bit system.  I observe the following behavior
> when running the examples from base::connections.  There appears to be a
> problem with seek() on a .gz file when using a 32-bit installation of
> R2.12.0, but the problem doesn't appear in the 64-bit installation.  I
> realize that seek() has been difficult in the past, and I don't want to open
> old wounds, but is this a known problem?  Is this easily fixable?  I have a
> package that relies on seek() when accessing gzipped files.
> 
> Using the 32-bit installation...
> 
> *> zz <- file("ex.data", "w")  # open an output file connection
> >      cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> >      cat("One more line\n", file = zz)
> >      close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w")  # compressed file
> >      cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> >      close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 7.80707e+17
> >
> > zz <- bzfile("ex.bz2", "w")  # bzip2-ed file
> >      cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> >      close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> >*
> 
> Using the 64-bit installation...
> 
> *> zz <- file("ex.data", "w")  # open an output file connection
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > cat("One more line\n", file = zz)
> > close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w")  # compressed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- bzfile("ex.bz2", "w")  # bzip2-ed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> > *
> 
> thanks,
> 
> Brandon
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com



More information about the R-devel mailing list