[Rd] Native characterset is wrong for unicode builds for Windows
maillist at tlink.de
maillist at tlink.de
Fri Feb 27 21:01:47 CET 2015
Am 27.02.2015 um 11:49 schrieb Duncan Murdoch:
> On 27/02/2015 2:31 AM, maillist at tlink.de wrote:
>> Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
>>> On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
>>>>> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>>>>>> When I send some outlandish characters through enc2native (or format) in
>>>>>> R 3.1.2 on Ubuntu trusty it works quite well:
>>>>>>
>>>>>> > "®ØΔЊת"
>>>>>> [1] "®ØΔЊת"
>>>>>> > enc2native("®ØΔЊת")
>>>>>> [1] "®ØΔЊת"
>>>>>> > Encoding(enc2native("®ØΔЊת"))
>>>>>> [1] "UTF-8"
>>>>>>
>>>>>> In Windows the result is different:
>>>>>>
>>>>>> > "®ØΔЊת"
>>>>>> [1] "®ØΔЊת"
>>>>>> > enc2native("®ØΔЊת")
>>>>>> [1] "®Ø<U+0394><U+040A><U+05EA>"
>>>>>> > Encoding(enc2native("®ØΔЊת"))
>>>>>> [1] "latin1"
>>>>>>
>>>>>> And this is wrong. The native character set of a unicode application
>>>>>> under Windows is *Unicode*. enc2native should do the same under Windows
>>>>>> as it does on Ubuntu. Also the "unknown" encoding should be changed to
>>>>>> mean the same as "UTF-8" exactly as it is on Linux.
>>>>> What is a "unicode application", and what makes you think R is one? R
>>>>> is being told by Windows that your native encoding is latin1. Perhaps
>>>>> Windows 8 supports UTF-8 as a native encoding (I've never used it), but
>>>>> previous versions of Windows didn't.
>>>>>
>>>>> Duncan Murdoch
>>>>>
>>>> A unicode application is a program that uses the unicode API of Windows
>>> R uses those functions, so I guess it is a "unicode application". But
>>> internally it uses an 8 bit encoding (normally the native one for the
>>> platform it is running on, which in your case is apparently latin1).
>>>
>>>> - the functions with the ending W. For such a application the system
>>>> code page (native encoding) is completely irrelevant. The system code
>>>> page is just a compatibility feature that enables Windows NT/Vista/7/8
>>>> to run applications that were developed for Windows 95 which didn't have
>>>> unicode support.
>>> Windows 95 had UCS-2 support, which was pretty close to UTF-16.
>>>
>>> But this line of operating systems is dead for 10 years
>>>> now. R obviously is a unicode application because it can print - or read
>>>> from the clipboard - characters like "ΔЊת" that are not in my system
>>>> code page which is not possible over the legacy API.
>>> So "unicode application" is something you just made up.
>>>
>>> If you use Windows development tools, they have macros to convert
>>> generic functions to either A or W versions. R doesn't use those. It
>>> calls the W functions when it has UTF-16 characters, and A functions
>>> when it has native characters. I would love it if R was a UTF-8
>>> application, because it would make life so much simpler, but Windows
>>> doesn't support that. So R needs to do tons of conversions. If you
>>> don't like that, you probably need to stick with Ubuntu.
>>>
>>> Duncan Murdoch
>>>
>> I am not complaining about those conversions. They work just fine
>> already. I am complaining about
>> enc2native breaking things in the windows builds. An assignment like
>>
>> s <- format("®ØΔЊת")
>>
>> has no interaction with windows at all yet "s" contains garbage like
>> "®Ø<U+0394><U+040A><U+05EA>"
>> after that. And if a native encoding of UTF-8 - as defined by enc2native
>> - works in Ubuntu why shouldn't it work
>> in Windows?
> Because in Ubuntu, UTF-8 is the native encoding, and in your Windows
> system, latin1 is the native encoding.
>
> But I do agree that the format() issue is a problem. I haven't traced
> through the code, but I think the string "®ØΔЊת" is read using Windows
> API functions that return a UTF-16 result, then converted by R to UTF-8.
> So format() should see that it is a UTF-8 string and not convert it to
> the native encoding. There is nothing wrong with enc2native(), it's
> doing what you asked for. The problem is that format() is using it.
>
> Duncan Murdoch
I would expect that every function that is using enc2native is broken in
this respect because it invariably will scramble most unicode characters
in the process and I can't think of a case where this could be wanted
actually.
Functions that really need something other than UTF-8 are probably using
iconv and getOption("encoding") anyway as this allows to specify the
desired encoding much more flexible.
More information about the R-devel
mailing list