[Rd] 2.3.1: interacting bugs in load() and gzfile() (PR#9271)
jbrzusto at fastmail.fm
jbrzusto at fastmail.fm
Mon Oct 2 20:09:33 CEST 2006
Hello,
If repeated calls are made to save() using the same pre-opened gzfile
connection to a file, and then the connection is closed, the objects
saved by the second and subsequent calls are not correctly restored by
repeated calls to load() with a new gzfile connection to the same file.
What follows are a session exposing the bugs, analysis (see ANALYSIS),
patches (see PATCHES), and a session showing correct behaviour after
patching (see FIXED). The problematic code is still present in the
trunk of the svn repository.
This is really due to two minor bugs, but because they interact,
I'm submitting this as one report.
Regards,
John Brzustowski
================================================================================
THE BUGS (with comments inserted)
> sessionInfo()
Version 2.3.1 (2006-06-01)
i686-pc-linux-gnu
attached base packages:
[1] "methods" "stats" "graphics" "grDevices" "utils" "datasets"
[7] "base"
## create two objects and save them to a gzfile:
> x <- 1:10
> y <- x^2
> f <- gzfile("tmp.dat", "wb")
> save(x, file=f)
## notice where the second object will be written:
> seek(f, NA)
[1] 88
> save(y, file=f)
> close(f)
## open the file and try to load the two objects, one at a time
> f <- gzfile("tmp.dat", "rb")
> e <- new.env()
> load(f,e)
> ls(e)
[1] "x"
## x was loaded correctly, but the connection has been closed:
> f
description class mode text
"gzcon(tmp.dat)" "gzcon" "rb" "binary"
opened can read can write
"closed" "yes" "no"
## <interaction> (see ANALYSIS)
## Try again:
> load(f,e)
Warning message:
this is already a gzcon connection in: gzcon(con, level, allowNonCompressed)
> ls(e)
[1] "x"
## why the warning?
## and why was nothing apparently read this time?
> f
description class mode text
"gzcon(tmp.dat)" "gzcon" "rb" "binary"
opened can read can write
"opened" "yes" "no"
> load(f,e)
Warning message:
this is already a gzcon connection in: gzcon(con, level, allowNonCompressed)
> ls(e)
[1] "x" "y"
## this time, y was read correctly, but again why the warning?
> f
description class mode text
"gzcon(tmp.dat)" "gzcon" "rb" "binary"
opened can read can write
"closed" "yes" "no"
## </interaction>
## try again, but first seek on the connection to the location of the second object
> close(f)
> f <- gzfile("tmp.dat", "rb")
> e <- new.env()
> seek(f,88)
[1] 0
> load(f,e)
> ls(e)
[1] "x"
## this should have loaded the object "y", which according to the earlier call to seek()
## begins at file offset 88.
========================================================================================
ANALYSIS
There are two minor bugs here:
1. The internal function loadFromConnection() incorrectly closes a
connection that was already open when it was called, instead of
closing a connection that was *not* already open.
2. gzfile() sets a class of its return value to "file", rather than
"gzfile":
> f <- gzfile("whatever.gz", "wb")
> class(f)
[1] "file" "connection"
In contrast, bzfile() exhibits correct behaviour:
> f2<-bzfile("more.bz2", "wb")
> class(f2)
[1] "bzfile" "connection"
This prevents load() from noticing that the connection passed to it is already
a gzfile, since it tests for inheritance from class "gzfile". Therefore,
load() always re-opens the gzfile connection with gzcon(), which causes
reading to proceed from the beginning of the file, regardless of any seek() on
the original gzfile connection. This reopening of a gzfile with gzcon() causes
the warning, and prevents the seek() before the load() from having any effect.
Interaction: because (with bug 2) load() always reopens a gzfile
connection, bug 1's effect is to not close that reopened connection.
Ironically, this allows (with warnings) objects from multiple calls to save() to be
restored by multiple calls to load(), as demonstrated by the
<interaction></interaction> section in the session above, provided load() is being
called with a filename or closed connection rather than an open gzfile
connection.
================================================================================
PATCHES:
for bug 1:
diff -u R-2.3.1.original/src/main/saveload.c R-2.3.1/src/main/saveload.c
--- R-2.3.1.original/src/main/saveload.c 2006-04-09 18:19:51.000000000 -0400
+++ R-2.3.1/src/main/saveload.c 2006-10-02 13:10:21.000000000 -0400
@@ -2301,6 +2301,7 @@
PROTECT(res = RestoreToEnv(R_Unserialize(&in), aenv));
if (wasopen) {
endcontext(&cntxt);
+ } else {
con->close(con);
}
UNPROTECT(1);
for bug 2:
diff -u R-2.3.1.original/src/main/connections.c R-2.3.1/src/main/connections.c
--- R-2.3.1.original/src/main/connections.c 2006-05-18 05:13:54.000000000 -0400
+++ R-2.3.1/src/main/connections.c 2006-10-02 11:53:10.000000000 -0400
@@ -1172,7 +1172,7 @@
PROTECT(ans = allocVector(INTSXP, 1));
INTEGER(ans)[0] = ncon;
PROTECT(class = allocVector(STRSXP, 2));
- SET_STRING_ELT(class, 0, mkChar("file"));
+ SET_STRING_ELT(class, 0, mkChar("gzfile"));
SET_STRING_ELT(class, 1, mkChar("connection"));
classgets(ans, class);
UNPROTECT(2);
================================================================================
FIXED:
Here's how the patched version operates:
## create two objects and serialize them to a gzfile
> x <- 1:10
> y <- x^2
> f <- gzfile("tmp.dat", "wb")
> save(x, file=f)
> seek(f, NA)
[1] 88
> save(y, file=f)
> close(f)
## re-open the gzfile
> f <- gzfile("tmp.dat", "rb")
## seek to the second object and load it
> seek(f,88)
[1] 0
> e <- new.env()
> load(f,e)
> ls(e)
[1] "y"
> eval(quote(y), e)
[1] 1 4 9 16 25 36 49 64 81 100
## seek to the first object and load it
> seek(f,0)
[1] 216
> load(f,e)
> ls(e)
[1] "x" "y"
> eval(quote(x), e)
[1] 1 2 3 4 5 6 7 8 9 10
>
More information about the R-devel
mailing list