[Rd] Native characterset is wrong for unicode builds for Windows

Thu Feb 26 23:44:33 CET 2015

On Thu, Feb 26, 2015 at 2:09 PM, maillist at tlink.de <maillist at tlink.de>
wrote:

>
> When I send some outlandish characters through enc2native (or format) in R
> 3.1.2 on Ubuntu trusty it works quite well:
>
> > "®ØΔЊת"
> [1] "®ØΔЊת"
> > enc2native("®ØΔЊת")
> [1] "®ØΔЊת"
> > Encoding(enc2native("®ØΔЊת"))
> [1] "UTF-8"
>
> In Windows the result is different:
>
> > "®ØΔЊת"
> [1] "®ØΔЊת"
> > enc2native("®ØΔЊת")
> [1] "®Ø<U+0394><U+040A><U+05EA>"
> > Encoding(enc2native("®ØΔЊת"))
> [1] "latin1"
>
> And this is wrong. The native character set of a unicode application under
> Windows is *Unicode*. enc2native should do the same under Windows as it
> does on Ubuntu. Also the "unknown" encoding should be changed to mean the
> same as "UTF-8" exactly as it is on Linux.
>

I think you're mixing up the term "character set" with the encoding for a
character set. Unicode is a character set. UTF-8 is one of many encodings
of Unicode.

UTF-8 may be the native character encoding in Ubuntu, but it's not the
native encoding in Windows. According to this, what counts as the native
encoding in Windows depends on the code page:
  http://stackoverflow.com/a/4649507

So you shouldn't expect enc2native to do the same thing on Linux and
Windows. If you really want UTF-8, you can use enc2utf8.

-Winston

	[[alternative HTML version deleted]]