[Rd] encoding question again

Prof Brian Ripley ripley at stats.ox.ac.uk
Sat Dec 29 18:28:58 CET 2007


On Sat, 29 Dec 2007, Simon Urbanek wrote:

> Oops, this was supposed to be a private reply ;) - sorry about the
> noise. The essence in English:
> JGR uses all strings in UTF-8 encoding, but the system locale reports
> CP1252 which impedes automatic conversions (because R doesn't know
> that everything is UTF-8). Specific conversion via iconv works as
> expected (see the example below).

On Windows there are no UTF-8 locales, but you can probably get the same 
effect by marking the strings via Encoding(), as they will be converted to 
CP1252 (a Latin-1 superset) on output.  A console that is running in a 
non-native encoding needs to convert everything going to and from R. 
We've experimented with running R in UTF-8 on Windows, but then you need 
to convert _everything_ coming in and going out and (and this is the 
killer) so would every package with C-level I/O.  (Tcl/Tk and Perl have 
gone down that route, and to a large extent left their extensions behind.)

>
> Cheers,
> Simon
>
> On Dec 29, 2007, at 11:11 AM, Simon Urbanek wrote:
>
>> Hallo Matthias,
>>
>> On Dec 27, 2007, at 3:52 PM, Matthias Wendel wrote:
>>
>>> Hi, simon,
>>> 	i followed your advice by adding/changing the lines
>>>  abt = iconv(abt,"utf-8","latin1")
>>>  zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
>>> encoding = "latin1")
>>> but this yielded the same results.
>>
>> Ich habe endlich eine Windows-Maschine zum Testen und bei mir wird der
>> Dateiname richtig angelegt ...
>>
>> Dennoch, anscheinend stimmt die locale nicht - denn JGR benutzt immer
>> UTF-8,  aber das System liefert CP1252. Deswegen scheint die
>> automatische Konvertierung nicht zu funktionieren
>> (file(...,encoding..)). Was allerding immer geht, ist die explizite
>> Konvertierung:
>>
>> a=file("foo","wt")
>> writeLines(iconv(..., "utf-8","latin1"),a)
>> close(a)
>>
>> (FWIW: da die empfohlene Kodierung von Webseiten sowieso UTF-8 ist,
>> braucht man es eigentlich nicht wirklich ... ;))
>>
>> charToRaw ist immer eine guter Test, weil UTF-8 fuer Umlaute meist 2-
>> bytes bracht und latin1 nur eins.
>>
>> Viele Gruesse,
>> Simon
>>
>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Simon Urbanek [mailto:simon.urbanek at r-project.org]
>>> Gesendet: Donnerstag, 27. Dezember 2007 21:40
>>> An: Matthias Wendel
>>> Cc: r-devel at r-project.org
>>> Betreff: Re: [Rd] encoding question again
>>>
>>> Matthias,
>>>
>>> you get exactly what you specified - namely UTF-8. If you want your
>>> html file to be latin1, then you have to say so:
>>>
>>> zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
>>> encoding = "latin1")
>>>
>>> In addition, you're assuming that `abt' is in the correct encoding
>>> to be understood by your OS. If it's not, you better convert it into
>>> one.
>>> From your results it seems as if `abt' is also UTF-8 encoded. Since
>>> you didn't tell us where you got that from, you should either fix
>>> the source or use something like iconv(abt,"utf-8","latin1"):
>>>
>>> (in UTF-8 locale)
>>>> abt="nür"
>>>> cat(abt,"\n")
>>> nür
>>>> charToRaw(abt)
>>> [1] 6e c3 bc 72
>>>> charToRaw(iconv(abt,"utf-8","latin1"))
>>> [1] 6e fc 72
>>>
>>> Cheers,
>>> Simon
>>>
>>>
>>> On Dec 27, 2007, at 3:11 PM, Matthias Wendel wrote:
>>>
>>>> Hi, R Devils,
>>>> I'm running the actual R version in JGR (version 1.5-8 ).
>>>> Sys.getlocale(category = "LC_ALL") yields [1]
>>>> "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.
>>>> 1252;LC_MONETARY=German_Germany.
>>>> 1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
>>>>
>>>> I want to write some HTML-Code enhanced by statistical results and
>>>> labels encoded in Latin-1, which I pass to a function. Some label
>>>> shall generate the filename. Although the labels are correctly
>>>> handled
>>>> in JGR they are somehow converted when they are written to the file.
>>>> Also the filename is not constructed as wanted. The function
>>>> definition is correctly sourced into R. The function is defined like
>>>> this:
>>>>
>>>> Itemtabelle.head <- function (abt ){
>>>> # nür zöm TÄST
>>>> zz = file( paste("Itemtabelle/Itemtabelle", abt, ".html"), "wt",
>>>> encoding = "UTF-8")
>>>> cat(as.character("<html
>>>> xmlns:o=\"urn:schemas-microsoft-com:office:office
>>>> \" xmlns:x=\"urn:schemas-microsoft-com:office:excel\"
>>>> xmlns=\"http://www.w3.org/TR/REC-html40
>>>> \">  \n"),
>>>>     as.character("
>>>> <
>>>> head
>>>>>
>>>>
>>>> \n "),
>>>> 		.
>>>> 		.
>>>> 		.
>>>>     as.character("        <td colspan=5 class=xl28 width=727 style=
>>>> \'width:545pt\'>Gesundheitsindikatoren:  "), abt, as.character("</
>>>> td>                                   \n"),
>>>>     as.character("       </
>>>> tr
>>>>>
>>>>
>>>> "), file  = zz)
>>>>     close(zz)
>>>>     unlink(zz)
>>>> }
>>>> Setting abt as " Ärzte Innere, Gynäkologie" and calling the function
>>>> with this argument, yields a filename "Itemtabelle  Ã?rzte Innere,
>>>> Gynäkologie .html" and in the file a line
>>>>       <td colspan=5 class=xl28 width=727 style='width:
>>>> 545pt'>Gesundheitsindikatoren:    ��rzte Innere, Gyn�¤kologie </
>>>> td>
>>>> is generated.                                 .
>>>> I tried to solve this by using iconv, without success.
>>>> The problem remains the same in the rgui and rterm - in rterm the
>>>> resulting filename is "Itemtabelle ?rzte Innere,
>>>> Gyn?kologie  .html".
>>>>
>>>> Cheers,
>>>> Matthias
>>>>
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>>
>>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-devel mailing list