[Rd] Native characterset is wrong for unicode builds for Windows

maillist at tlink.de maillist at tlink.de
Fri Feb 27 08:31:38 CET 2015


Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
> On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
>>> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>>>> When I send some outlandish characters through enc2native (or format) in
>>>> R 3.1.2 on Ubuntu trusty it works quite well:
>>>>
>>>>    > "®ØΔЊת"
>>>> [1] "®ØΔЊת"
>>>>    > enc2native("®ØΔЊת")
>>>> [1] "®ØΔЊת"
>>>>    > Encoding(enc2native("®ØΔЊת"))
>>>> [1] "UTF-8"
>>>>
>>>> In Windows the result is different:
>>>>
>>>>    > "®ØΔЊת"
>>>> [1] "®ØΔЊת"
>>>>    > enc2native("®ØΔЊת")
>>>> [1] "®Ø<U+0394><U+040A><U+05EA>"
>>>>    > Encoding(enc2native("®ØΔЊת"))
>>>> [1] "latin1"
>>>>
>>>> And this is wrong. The native character set of a unicode application
>>>> under Windows is *Unicode*. enc2native should do the same under Windows
>>>> as it does on Ubuntu. Also the "unknown" encoding should be changed to
>>>> mean the same as "UTF-8" exactly as it is on Linux.
>>> What is a "unicode application", and what makes you think R is one?  R
>>> is being told by Windows that your native encoding is latin1.  Perhaps
>>> Windows 8 supports UTF-8 as a native encoding (I've never used it), but
>>> previous versions of Windows didn't.
>>>
>>> Duncan Murdoch
>>>
>> A unicode application is a program that uses the unicode API of Windows
> R uses those functions, so I guess it is a "unicode application".  But
> internally it uses an 8 bit encoding (normally the native one for the
> platform it is running on, which in your case is apparently latin1).
>
>> - the functions with the ending W. For such a application the system
>> code page (native encoding) is completely irrelevant. The system code
>> page is just a compatibility feature that enables Windows NT/Vista/7/8
>> to run applications that were developed for Windows 95 which didn't have
>> unicode support.
> Windows 95 had UCS-2 support, which was pretty close to UTF-16.
>
> But this line of operating systems is dead for 10 years
>> now. R obviously is a unicode application because it can print - or read
>> from the clipboard - characters like "ΔЊת" that are not in my system
>> code page which is not possible over the legacy API.
> So "unicode application" is something you just made up.
>
> If you use Windows development tools, they have macros to convert
> generic functions to either A or W versions.  R doesn't use those.  It
> calls the W functions when it has UTF-16 characters, and A functions
> when it has native characters.  I would love it if R was a UTF-8
> application, because it would make life so much simpler, but Windows
> doesn't support that.  So R needs to do tons of conversions.  If you
> don't like that, you probably need to stick with Ubuntu.
>
> Duncan Murdoch
>

I am not complaining about those conversions. They work just fine 
already. I am complaining about
enc2native breaking things in the windows builds. An assignment like

s <- format("®ØΔЊת")

has no interaction with windows at all yet "s" contains garbage like  
"®Ø<U+0394><U+040A><U+05EA>"
after that. And if a native encoding of UTF-8 - as defined by enc2native 
- works in Ubuntu why shouldn't it work
in Windows?



More information about the R-devel mailing list