[Rd] Problem with UTF-8 text in the Rcmdr package

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Sep 8 11:46:17 CEST 2008


Unless Windows is running in CP1250 (the Slovenian encoding on Windows), 
this is not expected to work.  I believe John tested in CP1252, and it 
just so happens that those characters are in the same place in CP1250 and 
CP1252.

I get something different in CP1250, as pasting into the script window 
also does not work.  But if I use the Unicode escapes, the result in the 
output Window is rendered correctly in the output window.

I think Jaro has put his finger on this: Tcl/Tk output thinks it is in 
Latin-2 and not CP1250, and s and z caron have different positions in 
those two character sets.  Here is something I can reproduce easily: with 
XP set to Slovenian:

> x <-"ČŠŽčšž"
> x
[1] "ČŠŽčšž"
> charToRaw(x)
[1] c8 8a 8e e8 9a 9e

which is correct for CP1250.  Now if I submit 'x' in the Rcmdr script 
window, I get the wrong output in the output window.

And I've tracked that down to a bug in iconv (something we take from 
libiconv on Windows): it does think the native encoding is Latin-2, not 
CP1252.  I'll put a workaround in R-devel and R-patched shortly.  That has 
other potential ramifications that will take me longer to investigate, and 
correct thing may be to fix iconv.

On Sun, 7 Sep 2008, John Fox wrote:

> Dear Brian,
>
> Thank you for addressing the problem -- I was hoping that you would.
>
>> -----Original Message-----
>> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
>> Sent: September-07-08 7:23 AM
>> To: John Fox
>> Cc: 'R-devel'; 'Jaro.Lajovic'
>> Subject: Re: [Rd] Problem with UTF-8 text in the Rcmdr package
>>
>> The issue appears to be the Rcmdr output window and menus.  They are done
>> using Tcl/Tk, not by R.  So this might be a problem in Tcl/Tk or the fonts
>> it uses, or it might be problem with what Rcmdr passes to the tcltk
>> package.
>>
>> We need the means to reproduce this (as per the posting guide):
>
> Jaro provides an example in one of his messages in my posting (though it is
> slightly in error): If one enters
>
> cat("ČŠŽčšž\n")
>
> in the Rcmdr Script window, the characters are rendered correctly. Executing
> this command (via the Submit button) produces the following in the Output
> window:
>
>> cat("??????\n")
> ??????
>
> which actually appears as
>
>> cat("??\n")
> ??
>
> This is under Windows Vista / R 2.7.2 / Rcmdr 1.4-0.
>
>>
>> - what OSes are affected?  Does this occur in a UTF-8 locale on Linux, for
>> example?
>
> I've now checked under Mac OS X and Linux Ubuntu, with the following
> results:
>
> Under Mac OS X 10.5.4 / R 2.7.2 / Rcmdr 1.4-0 / Tcl/Tk 8.4
>
> cat("ČŠŽčšž\n") appears as cat("?????\n") in *both* the Script window and
> the Output window.
>
> Under Ubuntu Linux 8.04 / R 2.7.0 / Rcmdr 1.4-0/ Tcl/Tk 8.5
>
> cat("ČŠŽčšž\n") appears *correctly* in *both* the Script window and the
> Output window.
>
>>
>> - in what locales?
>
> I'm afraid that I don't know how to check this short of changing the locale
> for my Windows machine. I do observe the problem in Windows when I start
> Rgui with language=sl.
>
>>
>> - what versions of Tcl/Tk?  Note that shipped with Windows R
>> changed between 2.5.1 and 2.7.x.
>
> Yes, and please see above, but if the problem were with Tcl/Tk, why does
> this work in the Script window under Windows and in both Script and Output
> under Ubuntu?
>
>>
>> - Is this anything to do with translations?  I've not looked at how
>> translations are done in Rcmdr, but if gettext() is used, the string
>> passed to R for output is in the native encoding, so 'UTF-8 characters' is
>> incorrect.  It is possible that it is an iconv problem if the translations
>> are supplied in UTF-8 and not Latin-2.
>
> Yes, the Rcmdr package uses gettext(). Could Jaro avoid the problem by using
> Latin-2 in preference to UTF-8?
>
>>
>> There are far too many layers involved here to guess at what is going on.
>> My guess is that it ought to be possible to give a simple example of a
>> string which can be output to the Rcmdr console and will be rendered
>> incorrectly (together with a screen shot of how it is rendered).
>
> Indeed, please see above. I've also attached a screenshot under Windows,
> having started R with language=sl.
>
>>
>> I think the characters referred to are the Unicode glyphs 's and z with
>> caron', \u0161 and \u017E.  It seems that these will only be displayable
>> in Rcmdr on Windows in a Latin-2 locale, which I do not have set up on
>> Windows (but believe I could get installed).  However, examples using that
>> (and the menus) seem to be correct in both sl_SI.iso88592 and sl_SI.utf8
>> on Linux, which suggests that this is probably not an R issue but a Tcl/Tk
>> one.
>
> I'm above my depth with respect to these issues, but I do find it curious
> that under Windows the characters appears correctly in the Script window but
> not the Output window.
>
>>
>> On Fri, 5 Sep 2008, John Fox wrote:
>>
>>> Dear list members,
>>>
>>> I've attached some email correspondence with Jaro Lajovic (with his
>>> permission), detailing a problem with the Slovenian translation file for
>>> the Rcmdr package.
>>
>> Unfortunately, it is not 'detailed', and we do need the details.
>
> I hope that the additional information in this message will supply at least
> some of the necessary details.
>
> Thank you for your help,
> John
>
>>
>>> In brief, while certain UTF-8 characters used in Slovenian used to
>>> appear properly in older versions of R, some characters do not display
>>> properly in the Rcmdr menus and output window under R 2.7.x. I've
>>> confirmed the problem with the current version of the Rcmdr package
>>> (1.4-0) and R 2.7.2 under Windows Vista.
>>>
>>> I've checked the R docs and NEWS file for changes to R, but wasn't able
>>> to turn up anything that seemed relevant. Frankly, however, my
>>> understanding of how various character sets are handled is only partial.
>>>
>>> Any help would be appreciated.
>>>
>>> John
>>>
>>> ------------------------------
>>> John Fox, Professor
>>> Department of Sociology
>>> McMaster University
>>> Hamilton, Ontario, Canada
>>> web: socserv.mcmaster.ca/jfox
>>>
>>>
>>> -----Original Message-----
>>> From: Jaro.Lajovic [mailto:Jaro.Lajovic at mf.uni-lj.si]
>>> Sent: August-26-08 2:57 AM
>>> To: John Fox
>>> Subject: Re: Slovenian Rcmdr .po and .mo - and a problem
>>>
>>> Dear John,
>>>
>>>> That seems to imply that there's a change in R rather than in the Rcmdr
>>>> that produced this problem. Do you notice the problem with any other
>>>> packages that use translation or with R itself?
>>>
>>> As for other translated R packages, I am afraid I am not aware of any.
>>> However, a quick test using cat with special characters:
>>> cat "ČŠŽčšž\n"
>>> reveals that the string prints OK in the R (2.7.1.) console. The command
>>> line also shows OK in the Rcmdr Script window, but does not display
>>> right in the Output window. Special chars also fail in the Messages
> window.
>>>
>>> Input (Script window) thus seems not to be affected, while the menu
>>> system and output do not work properly.
>>>
>>> Thank you very much,
>>> Jaro
>>>
>>>
>>>> On Mon, 25 Aug 2008 21:54:43 +0200
>>>>  "Jaro.Lajovic" <Jaro.Lajovic at mf.uni-lj.si> wrote:
>>>>> Dear John,
>>>>>
>>>>>> One question though: I assume from your message that the previous
>>>>>> version of the Rcmdr worked OK with R 2.7.1. Is that right?
>>>>> No, the version 1.3-5 (that I still have with R 2.5.1) does not work
>>>>> with R 2.7.1 either. So:
>>>>>
>>>>> Rcmdr 1.3-5 with R 2.5.1: works OK.
>>>>> Rcmdr 1.3-5 with R 2.7.1: does not work properly.
>>>>> Rcmdr 1.4-0 with R 2.7.1: does not work properly.
>>>>>
>>>>> Thank you in advance,
>>>>> Jaro
>>>>>
>>>>>
>>>>>
>>>>>> On Mon, 25 Aug 2008 18:52:32 +0200
>>>>>>  "Jaro.Lajovic" <Jaro.Lajovic at mf.uni-lj.si> wrote:
>>>>>>> Dear John,
>>>>>>>
>>>>>>> Please find attached zipped Slovenian versions of .po (plain text
>>>>> and
>>>>>>> UTF-8 coded text) and .mo files.
>>>>>>>
>>>>>>> However, there seems to be a problem I have not been able to
>>>>> resolve.
>>>>>>> While special characters display properly under R version 2.5.1
>>>>> with
>>>>>>> Rcmdr 1.3-5, they fail to display (= are substituted by black
>>>>> blocks)
>>>>>>> under R version 2.7.1 with the new Rcmdr 1.4-0. By the way: the
>>>>> .mo
>>>>>>> file of the ver. 1.3-5 copied to 1.4-0 also failed to display
>>>>>>> properly.
>>>>>>>
>>>>>>> (An additional detail: three special characters that are used in
>>>>> the
>>>>>>> Slo version are c, s and z with hacek. c with hacek is not
>>>>> affected,
>>>>>>> it is just s and z with hacek that are not displayed OK.)
>>>>>>>
>>>>>>> Your advice will be much appreciated.
>>>>>>>
>>>>>>> With best regards,
>>>>>>> Jaro
>>>>
>>>> --------------------------------
>>>> John Fox, Professor
>>>> Department of Sociology
>>>> McMaster University
>>>> Hamilton, Ontario, Canada
>>>> http://socserv.mcmaster.ca/jfox/
>>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-devel mailing list