[Rd] Support writing UTF-8 output in Windows

Duncan Murdoch murdoch.duncan at gmail.com
Sun Nov 10 12:39:01 CET 2013


On 13-11-09 6:58 PM, Ben Bolker wrote:
> Duncan Murdoch <murdoch.duncan <at> gmail.com> writes:
>
>>
>> On 13-11-09 12:07 PM, Sverre Stausland wrote:
>>> As recently discussed on Stack Overflow, R for Mac OS and Ubuntu (so
>>> probably all Unix systems) can correctly write files with UTF-8
>>> encoding, but R for Windows cannot:
>>
>> That's not an accurate description of the problem.  Some functions in R
>> convert values to the native encoding, but not all do.
>>
>>> http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r
>>>
>>> I strongly suggest that R for Windows should support this feature in
>>> upcoming versions.
>>
>> It's not trivial to do.  When R was written, and perhaps still on some
>> obscure platforms, there wasn't any way to do that--Windows didn't
>> support UTF-8 then, just Microsoft's version of UCS-2 and a variety of
>> other more limited encodings.  Unix platforms didn't support UCS-2.  So
>> internally R keeps many things in the native encoding.
>>
>> If you decide to rewrite R from scratch now, I'd suggest that you handle
>> things differently.  If you'd rather not rewrite it yourself, then I
>> don't know how you will convince someone else to take on that job.
>>
>> You might find it easier to convince Microsoft to add a UTF-8 locale, so
>> then the native encoding would be UTF-8, and the problem would go away.
>>
>> Duncan Murdoch
>
>    Would it be fairer / more productive to say/ask:
>
> * it would be nice if write.table could write files in UTF-8 encoding

I agree.  A couple of months ago I investigated the fact that read.table 
could not read UTF-8 files if the characters could not be converted to 
the local encoding.  (E.g. reading Russian characters in an English 
locale seemed to be impossible.)  readLines() can read them, but 
read.table converted them to the native encoding and that killed them.

This is probably fixable, but it requires low level changes to a very 
commonly used function, so it's likely to break something somewhere.

I haven't looked closely at write.table, but I suspect the problem there 
is with format(). Connections know their encoding, but format() converts 
to the native encoding.

> * is there any documentation already available about which R functions
> _do_ handle UTF-8 output on Windows, and how they do it?

I don't think so.  In general, functions that convert to the native 
encoding break UTF-8 on Windows, because the native encoding is often 
Latin1 or some other encoding that doesn't cover all the characters in 
UTF-8.  If you look through the source you can work out which ones those 
are, but it's not easy.

> * could they be used as models for adapting write.table to write files
> in UTF-8 encoding on Windows?
>
>    i.e., instead of "convert R to output UTF-8 universally on Windows",
> "figure out how to make write.table output UTF-8 on Windows, or
> suggest a workaround" ?

I imagine if I (or someone else) attempt to get read.table working in 
this situation then I'd try to get write.table working too.

Duncan Murdoch



More information about the R-devel mailing list