[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

Daniel Possenriede po@@enr|ede @end|ng |rom gm@||@com
Thu Feb 7 15:10:11 CET 2019


There seems to be something odd with "∞" on Windows (and not only with
read.table)
In native encoding (cp-1252 in my case), "∞" gets converted to "8"

x <-  "∞"
Encoding(x)
#> [1] "unknown"
print(x)
#> [1] "8"
charToRaw(x)
#> [1] 38

"∞" is indeed "8"

identical(x, "8")
#> [1] TRUE

Everything seems fine if  "∞" is UTF-8 encoded.

y <- "\u221E"
Encoding(y)
#> [1] "UTF-8"
print(y)
#> [1]  "∞"
charToRaw(y)
#> [1] e2 88 9e

Unless the string is converted back to native encoding.

format(y)
#> [1] "8"

This ought to be "<U+221E>", equivalently to

format("∝")
#> [1] "<U+221D>"

Session Info:

si <- sessionInfo()
si$running
#> [1] "Windows 10 x64 (build 17134)"
si$R.version$version.string
#> [1] "R version 3.5.2 (2018-12-20)"
si$locale
#> [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"



Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
david.byrne222 using gmail.com>:

> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> most likely correct; it looks like its Windows specific.
>
> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd using gmail.com> wrote:
> >
> > This doesn't seem to be happening on MacOS, neither in Terminal nor
> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> >
> > -pd
> >
> > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 using gmail.com>
> wrote:
> > >
> > > Bug
> > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> > > file containing the infinity symbol (' ∞ ') results in the infinity
> > > symbol imported as the number 8. Other Unicode characters seem
> > > unaffected, example, Zhe: ж
> > >
> > > Expected Behavior:
> > > The imported data.frame should represent the infinity symbol as the
> > > expected 'Inf' so that normal mathematical operations can be processed
> > >
> > > Stack Overflow Post:
> > > I created a question on Stack Overflow where one other member was able
> > > to reproduce the same issues I was having. This question can be found
> > > at:
> > >
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> > >
> > > Method to Reproduce - 1:
> > > A simple method to reproduce this issues is to use R-Studio: In the
> > > console, type the following:
> > >> read.table(text=" ∞", encoding="UTF-8")
> > >
> > > The result should be a data.frame with a single value of '8'
> > >
> > > Repeating the same with ж Results in correct expected behavior
> > >
> > > Method to Reproduce - 2:
> > > Create a .csv file containing the infinity and Zhe characters (I have
> > > attached the file for convenience, hopefully it is no rejected by your
> > > email service). Launch an interactive session using
> > >
> > >> r --vanilla
> > >
> > > Enter the following statement taking care to replace the
> > > <path-to-file> with the appropriate one:
> > >
> > >> read.table("<path-to-file>/unicode_chars.csv", sep=",",
> encoding="UTF-8")
> > >
> > >
> > > This should result in a two element data.frame; the first being the
> > > incorrect value of 8 with an additional <U+FEFF> and the second the
> > > correct value of Zhe.
> > >
> > > Note the additional <U+FEFF> prefixed to the front of the '8'. This
> > > appears to be a hidden character for the purposes of letting editors
> > > know the encoding. The following link has some explanation however, it
> > > states this is caused by excel. The file I created was done so using
> > > notepad and not Excel.
> > >
> > >
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> > >
> > > System Details:
> > > OS:
> > >> Windows 10.0.17134 Build 17134
> > >
> > >
> > > R Version:
> > >> platform       x86_64-w64-mingw32
> > >> arch           x86_64
> > >> os             mingw32
> > >> system         x86_64, mingw32
> > >> status
> > >> major          3
> > >> minor          4.1
> > >> year           2017
> > >> month          06
> > >> day            30
> > >> svn rev        72865
> > >> language       R
> > >> version.string R version 3.4.1 (2017-06-30)
> > >> nickname       Single Candle
> > > ______________________________________________
> > > R-devel using r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list