[Rd] Native characterset is wrong for unicode builds for Windows
maillist at tlink.de
maillist at tlink.de
Fri Feb 27 00:34:03 CET 2015
> On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
>> When I send some outlandish characters through enc2native (or format) in
>> R 3.1.2 on Ubuntu trusty it works quite well:
>>
>> > "®ØΔЊת"
>> [1] "®ØΔЊת"
>> > enc2native("®ØΔЊת")
>> [1] "®ØΔЊת"
>> > Encoding(enc2native("®ØΔЊת"))
>> [1] "UTF-8"
>>
>> In Windows the result is different:
>>
>> > "®ØΔЊת"
>> [1] "®ØΔЊת"
>> > enc2native("®ØΔЊת")
>> [1] "®Ø<U+0394><U+040A><U+05EA>"
>> > Encoding(enc2native("®ØΔЊת"))
>> [1] "latin1"
>>
>> And this is wrong. The native character set of a unicode application
>> under Windows is *Unicode*. enc2native should do the same under Windows
>> as it does on Ubuntu. Also the "unknown" encoding should be changed to
>> mean the same as "UTF-8" exactly as it is on Linux.
> What is a "unicode application", and what makes you think R is one? R
> is being told by Windows that your native encoding is latin1. Perhaps
> Windows 8 supports UTF-8 as a native encoding (I've never used it), but
> previous versions of Windows didn't.
>
> Duncan Murdoch
>
A unicode application is a program that uses the unicode API of Windows
- the functions with the ending W. For such a application the system
code page (native encoding) is completely irrelevant. The system code
page is just a compatibility feature that enables Windows NT/Vista/7/8
to run applications that were developed for Windows 95 which didn't have
unicode support. But this line of operating systems is dead for 10 years
now. R obviously is a unicode application because it can print - or read
from the clipboard - characters like "ΔЊת" that are not in my system
code page which is not possible over the legacy API.
Neither the unicode API nor the legacy API accepts UTF-8. The legacy API
needs strings encoded according to the active code page and the unicode
API wants UTF-16. If you have UTF-8 you need to convert it in either to
the active code page which will loose all characters that are not
covered by it or convert to UTF-16 and use the unicode functions. But
this is not the problem, the Windows interface functions of R are
working quite nicely with unicode already.
More information about the R-devel
mailing list