[R] Problem with writing a file in UTF-8

Mon Feb 21 18:29:08 CET 2011

This is asking FAR too much under Windows, which has no UTF-8 locales. 
In particular, cat() (on which write() is based) will convert to the 
native locale, even if you manage to input the string as an R UTF-8 
string.

And conversion is a OS service, so you are getting the conversion 
Windows sees as appropriate.

The best way around this is to use a more capable OS.  But you can do 
e.g.

> x <- '\u0171\u0141'  # ensure this really is "űŁ"
> writeLines(x, 'foo', useBytes=TRUE) # ensure no conversion

On Mon, 21 Feb 2011, Matt Shotwell wrote:

> Thomas,
>
> I wasn't able to reproduce your finding. The last two characters in my
> 'out.txt' file were just as expected. But, I'm in an UTF-8 locale. Your
> locale affects the encoding of characters on your platform. If you're
> not in a UTF-8 locale, then characters are converted from your native
> encoding to UTF-8 (when you specify encoding="UTF-8"). In the process of
> conversion, it's possible to lose information. You can test whether
> there is a loss (or a change rather) when R writes these characters like
> so:
>
> # what does űŁ look like in binary (hex)?
> raw_before <- charToRaw("űŁ")
>
> # write 'out.txt' as before
> out <- file(description="out.txt", open="w", encoding="UTF-8")
> write(x="űŁ", file=out)
> close(con=out)
>
> # read in the two characters
> out <- file(description="out.txt", open="r", encoding="UTF-8")
> raw_after <- charToRaw(readChar(con=out, nchars=2))
> close(con=out)
>
> # compare the raw representations
> identical(raw_before, raw_after)
>
> This test passes on my machine. But, there's also the question of
> whether these characters made it onto R-help list unaltered. Also,
> please include the result of sessionInfo() in you subsequent messages.
>
> Best,
> Matt
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> i686-pc-linux-gnu
>
> locale:
> [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
> [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
> [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
> [7] LC_PAPER=en_US.utf8       LC_NAME=C
> [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> On Thu, 2011-02-17 at 13:54 -0800, tpklein wrote:
>
>> Hello,
>>
>> I am working with a data frame containg character strings with many special
>> symbols from various European languages.  When writing such character
>> strings to a file using the UTF-8 encoding, some of them are converted in a
>> strange way.  See the following example, run in R 2.12.1 on Windows 7:
>>
>> out <- file( description="out.txt", open="w", encoding="UTF-8")
>> write( x="äöüßæűŁ", file=out )
>> close( con=out )
>>
>> The last two symbols in the character string are converted to "uL" while all
>> other characters are not changed (which is what I want).  How to explain
>> this?  Does it have something to do with my locale?  And is there a way to
>> work around this problem? -- Any help would be greatly appreciated.
>>
>> Thomas
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595