[Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8
Daniel Possenriede
po@@enr|ede @end|ng |rom gm@||@com
Thu Feb 7 15:10:11 CET 2019
There seems to be something odd with "∞" on Windows (and not only with
read.table)
In native encoding (cp-1252 in my case), "∞" gets converted to "8"
x <- "∞"
Encoding(x)
#> [1] "unknown"
print(x)
#> [1] "8"
charToRaw(x)
#> [1] 38
"∞" is indeed "8"
identical(x, "8")
#> [1] TRUE
Everything seems fine if "∞" is UTF-8 encoded.
y <- "\u221E"
Encoding(y)
#> [1] "UTF-8"
print(y)
#> [1] "∞"
charToRaw(y)
#> [1] e2 88 9e
Unless the string is converted back to native encoding.
format(y)
#> [1] "8"
This ought to be "<U+221E>", equivalently to
format("∝")
#> [1] "<U+221D>"
Session Info:
si <- sessionInfo()
si$running
#> [1] "Windows 10 x64 (build 17134)"
si$R.version$version.string
#> [1] "R version 3.5.2 (2018-12-20)"
si$locale
#> [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
david.byrne222 using gmail.com>:
> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> most likely correct; it looks like its Windows specific.
>
> On Thu, 7 Feb 2019 at 12:55, peter dalgaard <pdalgd using gmail.com> wrote:
> >
> > This doesn't seem to be happening on MacOS, neither in Terminal nor
> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> >
> > -pd
> >
> > > On 7 Feb 2019, at 11:17 , David Byrne <david.byrne222 using gmail.com>
> wrote:
> > >
> > > Bug
> > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> > > file containing the infinity symbol (' ∞ ') results in the infinity
> > > symbol imported as the number 8. Other Unicode characters seem
> > > unaffected, example, Zhe: ж
> > >
> > > Expected Behavior:
> > > The imported data.frame should represent the infinity symbol as the
> > > expected 'Inf' so that normal mathematical operations can be processed
> > >
> > > Stack Overflow Post:
> > > I created a question on Stack Overflow where one other member was able
> > > to reproduce the same issues I was having. This question can be found
> > > at:
> > >
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> > >
> > > Method to Reproduce - 1:
> > > A simple method to reproduce this issues is to use R-Studio: In the
> > > console, type the following:
> > >> read.table(text=" ∞", encoding="UTF-8")
> > >
> > > The result should be a data.frame with a single value of '8'
> > >
> > > Repeating the same with ж Results in correct expected behavior
> > >
> > > Method to Reproduce - 2:
> > > Create a .csv file containing the infinity and Zhe characters (I have
> > > attached the file for convenience, hopefully it is no rejected by your
> > > email service). Launch an interactive session using
> > >
> > >> r --vanilla
> > >
> > > Enter the following statement taking care to replace the
> > > <path-to-file> with the appropriate one:
> > >
> > >> read.table("<path-to-file>/unicode_chars.csv", sep=",",
> encoding="UTF-8")
> > >
> > >
> > > This should result in a two element data.frame; the first being the
> > > incorrect value of 8 with an additional <U+FEFF> and the second the
> > > correct value of Zhe.
> > >
> > > Note the additional <U+FEFF> prefixed to the front of the '8'. This
> > > appears to be a hidden character for the purposes of letting editors
> > > know the encoding. The following link has some explanation however, it
> > > states this is caused by excel. The file I created was done so using
> > > notepad and not Excel.
> > >
> > >
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> > >
> > > System Details:
> > > OS:
> > >> Windows 10.0.17134 Build 17134
> > >
> > >
> > > R Version:
> > >> platform x86_64-w64-mingw32
> > >> arch x86_64
> > >> os mingw32
> > >> system x86_64, mingw32
> > >> status
> > >> major 3
> > >> minor 4.1
> > >> year 2017
> > >> month 06
> > >> day 30
> > >> svn rev 72865
> > >> language R
> > >> version.string R version 3.4.1 (2017-06-30)
> > >> nickname Single Candle
> > > ______________________________________________
> > > R-devel using r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list