[Rd] getting corrupted data when using readBin() after seek() on a gzfile connection
Henrik Bengtsson
hb at biostat.ucsf.edu
Wed May 8 19:54:00 CEST 2013
I can reproduce this (exactly the same output) on Windows:
> sessionInfo()
R version 3.0.0 Patched (2013-04-29 r62694)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.0.0
/Henrik
On Wed, May 8, 2013 at 1:51 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi,
>
> I'm running into more issues when reading data from a gzfile connection.
> If I read the data sequentially with successive calls to readBin(), the
> data I get looks ok. But if I call seek() between the successive calls
> to readBin(), I get corrupted data.
>
> Here is a (hopefully) reproducible example. See my sessionInfo() at the
> end (I'm not on Windows, where, according to the man page, seek() is
> broken).
>
> ## Generate data with a repeated easy-to-recognize byte pattern
> ## of length 26:
> mydata <- rep(charToRaw(paste(letters, collapse="")), 400)
>
> ## Write the data to test.gz file:
> con <- gzfile("test.gz", open="wb")
> writeBin(mydata, con)
> close(con)
>
> ## Read the data from test.gz file. We'll read blocks of 26 bytes
> ## located at various offsets that are multiple of 26, so we expect
> ## to see our original pattern ("abc...xyz").
> con <- gzfile("test.gz", open="rb")
>
> ## Offset 0: ok
> > rawToChar(readBin(con, "raw", n=26))
> [1] "abcdefghijklmnopqrstuvwxyz"
>
> ## Offset 78: still ok
> > seek(con, where=78)
> [1] 26
> > seek(con)
> [1] 78
> > rawToChar(readBin(con, "raw", n=26))
> [1] "abcdefghijklmnopqrstuvwxyz"
>
> ## Offset 520: data is messed up
> > seek(con, where=520)
> [1] 104
> > seek(con)
> [1] 520
> > rawToChar(readBin(con, "raw", n=26))
> [1] "zabcdefghijklmnopqrstuvvuv"
>
>
> ## Offset 2600: very messed up
> > seek(con, where=2600)
> [1] 546
> > seek(con)
> [1] 2600
> > rawToChar(readBin(con, "raw", n=26))
> [1] "xxxxxmpxxxxxxesxxxxxxxxxxp"
>
> ## Offset 10400: see previous email (subject: "error when calling
> ## seek() twice on a gzfile connection")
> > seek(con, where=10400)
> [1] 2626
> Warning message:
> In seek.connection(con, where = 10400) :
> seek on a gzfile connection returned an internal error
>
> close(con)
>
> Reading the data sequentially with no calls to seek() returns the
> expected pattern 400 times:
>
> con <- gzfile("test.gz", open="rb")
> blocks <- sapply(1:400, function(i) rawToChar(readBin(con, "raw", n=26)))
>
> ## Check the result:
>
> > readBin(con, "raw", n=26) # no more data
> raw(0)
>
> > seek(con)
> [1] 10400
>
> > table(blocks)
> blocks
> abcdefghijklmnopqrstuvwxyz
> 400
>
> Thanks,
> H.
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
More information about the R-devel
mailing list