[Rd] gzfile & read.table on Win32
Henrik Bengtsson
hb at maths.lth.se
Tue Mar 16 22:20:44 MET 2004
Hi, I ran into a the same problem some time ago, but I still haven't
had time to troubleshoot it very much. However, I found out that it
has to do with newlines at the end of the files. Here's an example
that might give some initial clues:
# Creating two example files:
cat("1 2\n3 4\n5 6\n7 8\n9 10\n11 12\n", file="tableBad.txt")
cat("1 2\n3 4\n5 6\n7 8\n9 10\n11 12", file="tableOk.txt")
# A first simple example
df1 <- read.table("tableOk.txt")
df2 <- read.table("tableBad.txt")
if (!identical(df1,df2)) cat("df1 != df2\n")
# Then...
df3 <- read.table(gzfile("tableOk.txt"))
if (!identical(df1,df3)) cat("df1 != df3\n")
# Gives: df1 != df3
# and...
df4 <- read.table(gzfile("tableBad.txt"))
# Warning message:
# number of items read is not a multiple of the number of columns
if (!identical(df1,df4)) cat("df1 != df4\n")
# Gives: df1 != df4
if (!identical(df3,df4)) cat("df3 != df4\n")
# Gives: df3 != df4
# Details:
str(df1)
# `data.frame': 6 obs. of 2 variables:
# $ V1: int 1 3 5 7 9 11
# $ V2: int 2 4 6 8 10 12
str(df3)
# `data.frame': 6 obs. of 2 variables:
# $ V1: int 1 3 5 7 9 11
# $ V2: Factor w/ 6 levels "10","12 ","2",..: 3 4 5 6 1 2
as.character(df3$V2)
# [1] "2" "4" "6" "8" "10" "12 " # Note the " "
str(df4)
# `data.frame': 7 obs. of 2 variables:
# $ V1: Factor w/ 7 levels "1","11","3","5",..: 1 3 4 5 6 2 7
# $ V2: int 2 4 6 8 10 12 NA
as.character(df4$V1)
# [1] "1" "3" "5" "7" "9" "11" " " # Note the " "
as.character(df4$V2)
# [1] "2" "4" "6" "8" "10" "12" NA
# Note that the " " is not a space, but
# i) Sun Solaris 8: ASCII 24/0x18/030
identical(as.character(df4$V1[7]), "\030")
# ii) WinXP: ASCII 255/0xFF/0377
identical(as.character(df4$V1[7]), "\377")
This was done on:
R v1.8.1 & R v1.9.0alpha on WinXP, and
R v1.8.1 on Sun Solaris 8
Now back to my other work ;)
Cheers
Henrik Bengtsson
> -----Original Message-----
> From: r-devel-bounces at stat.math.ethz.ch
> [mailto:r-devel-bounces at stat.math.ethz.ch] On Behalf Of Jeff Gentry
> Sent: den 15 mars 2004 22:17
> To: r-devel at stat.math.ethz.ch
> Subject: [Rd] gzfile & read.table on Win32
>
>
> Hello ...
>
> Are there any known problems or even gotchas to look out for
> when using a gzfile connection in read.csv/read.table in Windows?
>
> In the package PROcess, available at
> www.bioconductor.org/repository/devel/package/html/PROcess.html
> there are two files in the PROcess/inst/Test directory which
> are of the extension *.csv.gz.
>
> With both files, if I open up a gzfile connection, say:
> vv <- gzfile("122402imac40-s-c-192combined i11.csv.gz")
> I can then do:
> readLines(vv, n=10)
>
> And it works as expected. However, if I do this:
>
> read.csv(vv)
>
> I get a warning:
> Warning: incomplete final line found by readTableHeader on
> `c:/repository/checks/PROcess.Rcheck/PROcess/Test/122402imac40
> -s-c-192combined
> i11.csv.gz'
>
> and the results of the read.table are completely broken
> (basically it returns a 0 row matrix, with one column (with
> the first column name listed in the csv file). Furthermore,
> the connection variable itself seems to get mangled in the
> process, if I type the variable name (e.g. 'vv' from above), I get:
> > vv
> Error in summary.connection(x) : invalid connection
>
> Note that if I manually gunzip the file and then do a
> 'read.csv' in R, everything works properly - so it doesn't
> appear to be the actual file itself, but somehow related to
> reading it in as a compressed file.
>
> This is showing up both on R-1.8.1 and R-devel (admittedly a
> bit out of date, currently using 2004-03-08 and am trying to
> update on Windows now).
>
> Thanks
> -J
>
> ______________________________________________
> R-devel at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailma> n/listinfo/r-devel
>
>
More information about the R-devel
mailing list