[R] Truncated file upon reading a text file with 0xff characters

Jean-Claude Arbaut arbautjc at gmail.com
Tue Mar 15 22:00:03 CET 2016


Thank you for the answer. I was about to ask why I should avoid text
connections, but actually I just noticed that with a binary connection
for the read, the problem disappears (I mean, I replace "rt" with "rb"
in the file open).
R is even clever enough that, when feeded the latin1 file after an
options(encoding="UTF-8") and no encoding in the readLines, it returns
correctly a string with encoding "unknown" and byte 0xff in the raw
representation (I would have expected at least a warning, but it
silently reads bad UTF-8 bytes as simply raw bytes, it seems)

Thus the text connection does something more that causes a problem.
Maybe it tries to translate characters twice?

And this problem remains with read.table. Not surprising: by
inspecting the source, I see it uses open(file "rt").

Jean-Claude Arbaut


2016-03-15 21:24 GMT+01:00 Duncan Murdoch <murdoch.duncan at gmail.com>:
> I think you've identified a bug (or more than one) here, but your message is
> so long, I haven't had time to go through it all.  I'd suggest that you
> write up a shorter version for the bug list.  The shorter version would
>
> 1.  Write the latin1 file using writeBin.
> 2.  Set options(encoding = "") and read it without error.
> 3.  Set options(encoding = "UTF-8") and get an error even if you explicitly
> set encoding when reading.
> 4.  Set options(encoding = "latin1") and also get an error with or without
> explicitly setting the encoding.
>
> I would limit the tests to readLines; read.table is much more complicated,
> and isn't necessary to illustrate the problem.  It just confuses things by
> bringing it into the discussion.
>
> You should also avoid bringing text mode connections into the discussion
> unless they are necessary.
>
> Duncan Murdoch
>
>
> On 15/03/2016 3:05 PM, Jean-Claude Arbaut wrote:
>>
>> Hello R users,
>>
>> I am having problems to read a CSV file that contains names with character
>> ÿ.
>> In case it doesn't print correctly, it's Unicode character 00FF or LATIN
>> SMALL
>> LETTER Y WITH DIAERESIS.
>> My computer has Windows 7 and R 3.2.4.
>>
>> Initially, I configured my computer to run options(encoding="UTF-8")
>> in my .Rprofile,
>> since I prefer this encoding, for portability. Good and modern
>> standard, I thought.
>> Rather than sending a large file, here is how to reproduce my problem:
>>
>>    options(encoding="UTF-8")
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
>> f, size=1)
>>    close(f)
>>    read.table("test.txt", encoding="latin1")
>>    f <- file("test.txt", "rt")
>>    readLines(f, encoding="latin1")
>>    close(f)
>>
>> I write a file with three lines, in binary to avoid any translation:
>> A
>> B\xffC
>> D
>>
>> Upon reading I get only:
>>
>>    > read.table("test.txt", encoding="latin1")
>>      V1
>>    1  A
>>    2  B
>>    Warning messages:
>>    1: In read.table("test.txt", encoding = "latin1") :
>>      invalid input found on input connection 'test.txt'
>>    2: In read.table("test.txt", encoding = "latin1") :
>>      incomplete final line found by readTableHeader on 'test.txt'
>>    > readLines(f, encoding="latin1")
>>    [1] "A" "B"
>>    Warning messages:
>>    1: In readLines(f, encoding = "latin1") :
>>      invalid input found on input connection 'test.txt'
>>    2: In readLines(f, encoding = "latin1") :
>>      incomplete final line found on 'test.txt'
>>
>> Hence the file is truncated. However, character \xff is a valid latin1
>> character,
>> as one can check for instance at
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>> I tried with an UTF-8 version of this file:
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
>> 10)), f, size=1)
>>    close(f)
>>    read.table("test.txt", encoding="UTF-8")
>>    f <- file("test.txt", "rt")
>>    readLines(f, encoding="UTF-8")
>>    close(f)
>>
>> Since this character ÿ is encoded as two bytes 195, 191 in UTF-8, I would
>> expect
>> that I get my complete file. But I don't. Instead, I get:
>>
>>    > read.table("test.txt", encoding="UTF-8")
>>      V1
>>    1  A
>>    2  B
>>    3  C
>>    4  D
>>    Warning message:
>>    In read.table("test.txt", encoding = "UTF-8") :
>>      incomplete final line found by readTableHeader on 'test.txt'
>>
>>    > readLines(f, encoding="UTF-8")
>>    [1] "A" "B"
>>    Warning message:
>>    In readLines(f, encoding = "UTF-8") :
>>      incomplete final line found on 'test.txt'
>>
>> I tried all the preceding but with options(encoding="latin1") at the
>> beginning.
>> For the first attempt, with byte 255, I get:
>>
>>    > read.table("test.txt", encoding="latin1")
>>      V1
>>    1  A
>>    2  B
>>    3  C
>>    4  D
>>    Warning message:
>>    In read.table("test.txt", encoding = "latin1") :
>>      incomplete final line found by readTableHeader on 'test.txt'
>>    >
>>    > f <- file("test.txt", "rt")
>>    > readLines(f, encoding="latin1")
>>
>> For the other attempt, with 195, 191:
>>
>>    > read.table("test.txt", encoding="UTF-8")
>>       V1
>>    1   A
>>    2 BÿC
>>    3   D
>>    >
>>    > f <- file("test.txt", "rt")
>>    > readLines(f, encoding="UTF-8")
>>    [1] "A"   "BÿC" "D"
>>    > close(f)
>>
>> Thus the second one does indeed work, it seems. Just a check:
>>
>>    > a <- read.table("test.txt", encoding="UTF-8")
>>    > Encoding(a$V1)
>>    [1] "unknown" "UTF-8"   "unknown"
>>
>> At last, I figured out that with the default encoding in R, both attempts
>> work,
>> with or without even giving the encoding as a parameter of read.table
>> or readLines.
>> However, I don't understand what happens:
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)),
>> f, size=1)
>>    close(f)
>>    a <- read.table("test.txt", encoding="latin1")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>    a <- read.table("test.txt")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>
>> This will yield:
>>
>>    > a <- read.table("test.txt", encoding="latin1")$V1
>>    > Encoding(a)
>>    [1] "unknown" "latin1"  "unknown"
>>    > iconv(a[2], toRaw=T)
>>    [[1]]
>>    [1] 42 ff 43
>>    > a
>>    [1] "A"   "BÿC" "D"
>>    >
>>    > a <- read.table("test.txt")$V1
>>    > Encoding(a)
>>    [1] "unknown" "unknown" "unknown"
>>    > iconv(a[2], toRaw=T)
>>    [[1]]
>>    [1] 42 ff 43
>>    > a
>>    [1] "A"   "BÿC" "D"
>>
>> The second line is correctly encoded, but the encoding is just not
>> "marked" in one case.
>> With the UTF-8 bytes:
>>
>>    f <- file("test.txt", "wb")
>>    writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13,
>> 10)), f, size=1)
>>    close(f)
>>    a <- read.table("test.txt", encoding="UTF-8")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>    a <- read.table("test.txt")$V1
>>    Encoding(a)
>>    iconv(a[2], toRaw=T)
>>    a
>>
>> This will yield:
>>
>> > a <- read.table("test.txt", encoding="UTF-8")$V1
>> > Encoding(a)
>> [1] "unknown" "UTF-8"   "unknown"
>> > iconv(a[2], toRaw=T)
>> [[1]]
>> [1] 42 c3 bf 43
>> > a
>> [1] "A"   "BÿC" "D"
>> > a <- read.table("test.txt")$V1
>> > Encoding(a)
>> [1] "unknown" "unknown" "unknown"
>> > iconv(a[2], toRaw=T)
>> [[1]]
>> [1] 42 c3 bf 43
>> > a
>> [1] "A"    "BÿC" "D"
>>
>> Both are correctly read (the raw bytes are ok), but the second one doesn't
>> print
>> correctly because the encoding is not "marked".
>>
>> My thoughts:
>> With options(encoding="native.enc"), the characters read are not
>> translated, and are read
>> as raw bytes, which can get an encoding mark to print correctly (otherwise
>> it
>> prints as native, that is mostly latin1).
>> With options(encoding="latin1"), and reading the UTF-8 file, I guess it's
>> mostly
>> like the preceding: the characters are read as raw, and marked as
>> UTF-8, which works.
>> With options(encoding="latin1"), and reading the latin1 file (with the
>> 0xFF byte),
>> I don't understand what happens. The file gets truncated almost as if 0xFF
>> were
>> an EOF character - which is perplexing, since I think that in C, 0xFF
>> is sometimes
>> (wrongly) confused with EOF.
>> And with options(encoding="UTF-8"), I am not sure what happens.
>>
>> Questions:
>> * What's wrong with options(encoding="latin1")?
>> * Is it unsafe to use another option(encoding) than the default
>> native.enc, on Windows?
>> * Is it safe to assume that with native.enc R reads raw characters
>> and, only when requested,
>>    marks an encoding afterwards? (that is, I get "unknown" by default
>> which is printed
>>    as latin1 on Windows, and if I enforce another encoding, it will be
>> used whatever
>>    the bytes really are)
>> * What does really happen with another option(encoding), especially UTF-8?
>> * If I save a character variable to an Rdata file, is the file usable
>> on another OS,
>>    or on the same with another default encoding (by changing
>> options())? Does it depend
>>    whether the character string has un "unknown" encoding or an explicit
>> one?
>> * Is there a way (preferably an options()) to tell R to read text
>> files as UTF-8 by default?
>>    Would it work with any one of read.table(), readLines(), or even
>> source()?
>>    I thought options(encoding="UTF-8") would do, but it fails on the
>> examples above.
>>
>> Best regards,
>>
>> Jean-Claude Arbaut
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list