[Rd] Support writing UTF-8 output in Windows
Sverre Stausland
stausland.johnsen at iln.uio.no
Sun Nov 10 20:41:52 CET 2013
By UTF-18 I meant UTF-16, obviously.
On Sun, Nov 10, 2013 at 8:41 PM, Sverre Stausland
<stausland.johnsen at iln.uio.no> wrote:
> With respect to your comment (sorry, the e-mail you wrote that in
> didn't get to my inbox):
>
>>> I don't think so. In general, functions that convert to the native
>>> encoding break UTF-8 on Windows, because the native encoding is often
>>> Latin1 or some other encoding that doesn't cover all the characters in
>>> UTF-8.
>
> As I understand it, the native encoding in Windows is UTF-18, not Latin1:
> http://msdn.microsoft.com/en-us/library/dd374081.aspx
>
> And UTF-18 is a superset of UTF-8, isn't it?
>
> Sverre
>
> On Sun, Nov 10, 2013 at 1:49 PM, Duncan Murdoch
> <murdoch.duncan at gmail.com> wrote:
>> On 13-11-10 7:31 AM, Sverre Stausland wrote:
>>>
>>> My e-mail was intended as a typical "feature request", and I couldn't
>>> find any more suitable place for that than the r-devel mailing list. I
>>> am not a programmer, so I don't have the skills to write this into R's
>>> source code myself.
>>>
>>> The incentive is nevertheless clear enough. I believe a software
>>> program in 2013 which imports, manipulates, and exports text in
>>> various formats (text files, picture files, postscript files, etc.)
>>> would normally be expected to support UTF-8. It might not be trivial
>>> to implement as R is written now, but the expectation will still be
>>> there. So I still believe it would be a good idea if R soon would be
>>> able to support UTF-8.
>>
>>
>> R does support UTF-8. It all works smoothly in a UTF-8 locale, not so
>> smoothly if you have your computer set up to use a different 8 bit encoding.
>>
>>>
>>> I'm not quite able to piece together from the information you gave
>>> what the underlying issues are. What I read is:
>>> (1) Some R functions convert characters to the native encoding.
>>> (2) Windows did not support UTF-8 when R was first written.
>>> (3) Unix did not support UCS-2 when R was first written.
>>>
>>> I'm guessing here that the implications are:
>>> (1) R's write.table() converts characters to a native encoding.
>>> (2) The native encoding in Windows 7 is not UTF-8.
>>> (3) The native encoding in Unix systems is UTF-8.
>>
>>
>> You got it right for the first 4. Regarding (2) in your second list, that's
>> right, and in fact UTF-8 is not supported as a native encoding.
>> And point (3) is optional, though UTF-8 is the dominant encoding nowadays.
>>
>> The easiest solution is for you to switch to a Unix variant and set it up to
>> use UTF-8 as the native encoding.
>>
>> Next easiest would be for Microsoft to add UTF-8 as a code page.
>>
>> Most difficult would be for R to handle UTF-8 properly on systems with
>> limited support for it.
>>
>> We probably will add small changes that let you work around the Windows
>> problems, but they won't be very satisfactory to anyone. I don't think we
>> will make the big changes that would make R look like "a software program in
>> 2013", since it would be so much work, and there's such an easy workaround.
>>
>> Duncan Murdoch
>>
>>
>>> But this is just guesswork.
>>
>>
>>
>>>
>>> PS. A related issue:
>>>
>>> http://stackoverflow.com/questions/19881553/using-unicode-inside-rs-expression-command
>>>
>>> Sverre
>>>
>>
More information about the R-devel
mailing list