[Rd] Native characterset is wrong for unicode builds for Windows
maillist at tlink.de
maillist at tlink.de
Fri Feb 27 00:55:25 CET 2015
Am 26.02.2015 um 23:44 schrieb Winston Chang:
> On Thu, Feb 26, 2015 at 2:09 PM, maillist at tlink.de
> <mailto:maillist at tlink.de> <maillist at tlink.de
> <mailto:maillist at tlink.de>> wrote:
>
>
> When I send some outlandish characters through enc2native (or
> format) in R 3.1.2 on Ubuntu trusty it works quite well:
>
> > "®ØΔЊת"
> [1] "®ØΔЊת"
> > enc2native("®ØΔЊת")
> [1] "®ØΔЊת"
> > Encoding(enc2native("®ØΔЊת"))
> [1] "UTF-8"
>
> In Windows the result is different:
>
> > "®ØΔЊת"
> [1] "®ØΔЊת"
> > enc2native("®ØΔЊת")
> [1] "®Ø<U+0394><U+040A><U+05EA>"
> > Encoding(enc2native("®ØΔЊת"))
> [1] "latin1"
>
> And this is wrong. The native character set of a unicode
> application under Windows is *Unicode*. enc2native should do the
> same under Windows as it does on Ubuntu. Also the "unknown"
> encoding should be changed to mean the same as "UTF-8" exactly as
> it is on Linux.
>
>
> I think you're mixing up the term "character set" with the encoding
> for a character set. Unicode is a character set. UTF-8 is one of many
> encodings of Unicode.
>
> UTF-8 may be the native character encoding in Ubuntu, but it's not the
> native encoding in Windows. According to this, what counts as the
> native encoding in Windows depends on the code page:
> http://stackoverflow.com/a/4649507
>
> So you shouldn't expect enc2native to do the same thing on Linux and
> Windows. If you really want UTF-8, you can use enc2utf8.
>
> -Winston
Maybe I'm expecting too much but I rather have R not to produce garbage
like "®Ø<U+0394><U+040A><U+05EA>" and while I can use enc2utf8 to
convert from UTF-8 to UTF-8 this does not fix the many places - like
"format" - where enc2native is used and that are broken because of this.
[[alternative HTML version deleted]]
More information about the R-devel
mailing list