[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
Tomas Kalibera
tom@@@k@||ber@ @end|ng |rom gm@||@com
Fri Feb 8 17:23:17 CET 2019
I can reproduce with read.table(encoding="UTF-8") in RGui on Windows 10,
reading a file containing the two UTF-8 characters. The table is read
correctly into R as documented (both characters are represented in UTF-8
and marked as such), but, the conversion of Infinity to 8 and of Zhe to
<U+0436> happens later during printing using print.data.frame(). For
instance, it currently does not happen during print(as.matrix()). As I
wrote in more detail in another email in this thread, R sometimes needs
to convert strings to the current native encoding, Windows converts
Infinity to 8 by default as "best fit", but fails to convert Zhe, so R
displays the <U+436>.
It is easiest to only use input files in current native encoding, so one
could convert before passing them to R and make sure the conversion does
not have similar problems... or use R on a non-Windows platform.
Relying on which R functions/packages can work with non-native encodings
may be brittle, but of course any R function that documents to work with
non-native encodings (like read.table(encoding=)) should do so. If not,
it will be fixed following a bug report.
I am not sure if that is what you had in mind, but conversion of
character (string) to double is a different matter. as.double() now as
documented in ?as.double returns NA for "∞" (on Linux).
Best
Tomas
On 2/7/19 11:17 AM, David Byrne wrote:
> Bug
> Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> file containing the infinity symbol (' ∞ ') results in the infinity
> symbol imported as the number 8. Other Unicode characters seem
> unaffected, example, Zhe: ж
>
> Expected Behavior:
> The imported data.frame should represent the infinity symbol as the
> expected 'Inf' so that normal mathematical operations can be processed
>
> Stack Overflow Post:
> I created a question on Stack Overflow where one other member was able
> to reproduce the same issues I was having. This question can be found
> at:
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
>
> Method to Reproduce - 1:
> A simple method to reproduce this issues is to use R-Studio: In the
> console, type the following:
>> read.table(text=" ∞", encoding="UTF-8")
> The result should be a data.frame with a single value of '8'
>
> Repeating the same with ж Results in correct expected behavior
>
> Method to Reproduce - 2:
> Create a .csv file containing the infinity and Zhe characters (I have
> attached the file for convenience, hopefully it is no rejected by your
> email service). Launch an interactive session using
>
>> r --vanilla
> Enter the following statement taking care to replace the
> <path-to-file> with the appropriate one:
>
>> read.table("<path-to-file>/unicode_chars.csv", sep=",", encoding="UTF-8")
>
> This should result in a two element data.frame; the first being the
> incorrect value of 8 with an additional <U+FEFF> and the second the
> correct value of Zhe.
>
> Note the additional <U+FEFF> prefixed to the front of the '8'. This
> appears to be a hidden character for the purposes of letting editors
> know the encoding. The following link has some explanation however, it
> states this is caused by excel. The file I created was done so using
> notepad and not Excel.
>
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
>
> System Details:
> OS:
>> Windows 10.0.17134 Build 17134
>
> R Version:
>> platform x86_64-w64-mingw32
>> arch x86_64
>> os mingw32
>> system x86_64, mingw32
>> status
>> major 3
>> minor 4.1
>> year 2017
>> month 06
>> day 30
>> svn rev 72865
>> language R
>> version.string R version 3.4.1 (2017-06-30)
>> nickname Single Candle
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
[[alternative HTML version deleted]]
More information about the R-devel
mailing list