[R] Problem with writing a file in UTF-8
David Heffernan
david at heffs.org.uk
Mon Feb 21 20:39:54 CET 2011
Windows is perfectly capable of handling UTF-8, but its native
encoding is UTF-16LE. Applications on Windows are meant to work with
text data in the UTF-16LE encoding. If it needs to be converted to or
from another encoding then there are services that do this (which
work). There are countless programs on Windows that are 100% Unicode
compliant.
I don't know how R holds text data, but perhaps it holds it as char*
as do Python, Perl etc. and all the other such languages that have
problems doing Unicode properly on Windows.
Basically I don't buy the idea that Windows can't do Unicode. It's
supported Unicode since NT was released back in 1991. That's 20 years
ago now. It's just too easy to blame it on Windows but it doesn't
ring true.
David Heffernan.
On Feb 21, 5:29 pm, Prof Brian Ripley <rip... at stats.ox.ac.uk> wrote:
> This is asking FAR too much under Windows, which has no UTF-8 locales.
> In particular, cat() (on which write() is based) will convert to the
> native locale, even if you manage to input the string as an R UTF-8
> string.
>
> And conversion is a OS service, so you are getting the conversion
> Windows sees as appropriate.
>
> The best way around this is to use a more capable OS. But you can do
> e.g.
>
>
>
>
>
>
>
>
>
> > x <- '\u0171\u0141' # ensure this really is "űŁ"
> > writeLines(x, 'foo', useBytes=TRUE) # ensure no conversion
> On Mon, 21 Feb 2011, Matt Shotwell wrote:
> > Thomas,
>
> > I wasn't able to reproduce your finding. The last two characters in my
> > 'out.txt' file were just as expected. But, I'm in an UTF-8 locale. Your
> > locale affects the encoding of characters on your platform. If you're
> > not in a UTF-8 locale, then characters are converted from your native
> > encoding to UTF-8 (when you specify encoding="UTF-8"). In the process of
> > conversion, it's possible to lose information. You can test whether
> > there is a loss (or a change rather) when R writes these characters like
> > so:
>
> > # what does űŁ look like in binary (hex)?
> > raw_before <- charToRaw("űŁ")
>
> > # write 'out.txt' as before
> > out <- file(description="out.txt", open="w", encoding="UTF-8")
> > write(x="űŁ", file=out)
> > close(con=out)
>
> > # read in the two characters
> > out <- file(description="out.txt", open="r", encoding="UTF-8")
> > raw_after <- charToRaw(readChar(con=out, nchars=2))
> > close(con=out)
>
> > # compare the raw representations
> > identical(raw_before, raw_after)
>
> > This test passes on my machine. But, there's also the question of
> > whether these characters made it onto R-help list unaltered. Also,
> > please include the result of sessionInfo() in you subsequent messages.
>
> > Best,
> > Matt
>
> >> sessionInfo()
> > R version 2.11.1 (2010-05-31)
> > i686-pc-linux-gnu
>
> > locale:
> > [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
> > [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
> > [5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
> > [7] LC_PAPER=en_US.utf8 LC_NAME=C
> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
>
> > On Thu, 2011-02-17 at 13:54 -0800, tpklein wrote:
>
> >> Hello,
>
> >> I am working with a data frame containg character strings with many special
> >> symbols from various European languages. When writing such character
> >> strings to a file using the UTF-8 encoding, some of them are converted in a
> >> strange way. See the following example, run in R 2.12.1 on Windows 7:
>
> >> out <- file( description="out.txt", open="w", encoding="UTF-8")
> >> write( x="äöüßæűŁ", file=out )
> >> close( con=out )
>
> >> The last two symbols in the character string are converted to "uL" while all
> >> other characters are not changed (which is what I want). How to explain
> >> this? Does it have something to do with my locale? And is there a way to
> >> work around this problem? -- Any help would be greatly appreciated.
>
> >> Thomas
>
> > ______________________________________________
> > R-h... at r-project.org mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Brian D. Ripley, rip... at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list