[Rd] latin1,utf-8...encoding and data

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Oct 25 11:42:58 CEST 2006


This is indeed unfortunate, but expecting Chinese speakers (20% of the 
world's population) to write in Latin-1 was also unfortunate.

What I had (and still have) some hope of doing is being able to mark 
character strings as UTF-8, probably via a flag bit on the CHARSXP.  Then 
output routines could be made to convert (if possible) to the current 
locale.  But that was before I found out how hard it was to get non-ASCII 
characters displayed correctly.  Also, any solution to this almost 
certainly means abandoning Windows 95/98/ME as they don't have support for 
Unicode (and although we could add such support at C level, they would not 
have the Unicode fonts needed).  (It might be OK to do that now, but it 
was not a couple of years ago.)

Don't underestimate the font problem.  Last week I gave a seminar about 
statistical computing in my own dept, and I thought I would show R 
operating in Chinese (we have quite a number of Chinese speakers, indeed 
more than from Latin-1 languages).  It did not work correctly in my 
pre-seminar tests, because there were no CJK fonts installed on the 
lecture-room computer.  There was no warning nor error, just 
unintelligible output.

If you are only concerned with Latin-1 and UTF-8, there is something you 
can do.  Rather than have a .rda file, store your datasets as .R files, 
with another .R file as a driver.  So you would need something like

ex1.R:
source("ex1_dat.R", encoding="latin1")

ex_dat.R:
dump of the object, converted to latin1.

If you don't specify lazydata, this will ensure the object gets converted 
to the current locale when the data() statement is executed.  If you do 
specify lazydata, the conversion will happen when the package is 
installed, which is fine if you (and any other users) always use the same 
locale (or at least always use a locale with the same encoding, e.g. 
always use a UTF-8 locale).  However, this is really only of use in 
locales that will have font coverage of Latin-1, and R installations 
without iconv will not do any necessary conversion (which is why I 
suggest dumping in latin1 and not in UTF-8).


On Thu, 19 Oct 2006, Martin Maechler wrote:

>>>>>> "Stéphane" == Stéphane Dray <dray at biomserv.univ-lyon1.fr>
>>>>>>     on Thu, 19 Oct 2006 09:46:49 +0200 writes:
>
>    Stéphane> Thanks a lot for this clear answer. So there is no way to preserve our
>    Stéphane> french cultural exception (accented characters),
>
> I agree that there are many French cultural exceptions ;-)
> --- and as a Swiss, I highly estimate several of them ---
> however "accented" characters (with the appropriate meaning of "accented")
> are not at all a French exception, rather almost a continental
> European one {as long as we are staying in the "latin" alphabet
> context}.  If I think of what I know of Europe, the only
> country/language *not* using some version of "accented"
> characters are the British and (I think) the Dutch/Flamish.
> Everyone else (? probably I forgot some, and don't know about others
> like gaelic,...)  has some kind of accents...
>
> I agree with Stéphane that this is unfortunate for quite a few
> of us, and it came as a big surprise to me when I first heard
> about this from Brian.  .. aah, life was easy when we western
> chauvinists could behave as if the whole relevant part of the
> world was happy with iso-latin1...
>
> Martin
>
>
>    Stéphane> if we want to be international... I have thought
>    Stéphane> that the inclusion of a parameter encoding in data
>    Stéphane> function (e.g. data(mydata,encoding="latin1"))
>    Stéphane> like in the function 'file' could be an way to
>    Stéphane> solve the problem. Apparently, the problem is much
>    Stéphane> more complicated...
>
>    Stéphane> Sincerely.
>
>
>    Stéphane> Prof Brian Ripley wrote:
>
>    >> Only ASCII letters are portable: those accented characters do not even
>    >> exist in many of the encodings used for R, e.g. Russian and Japanese
>    >> on Windows machines.
>    >>
>    >> There is no way to associate an encoding with a character string in
>    >> R.  We considered it, but it would have had severe back-compatibility
>    >> problems and little advantage (you cannot display non-ASCII character
>    >> strings portably: even if you have a Unicode encoding you still need
>    >> to select a suitable font).
>    >>
>    >> 'B. Ripley' (sic)
>    >>
>    >>
>    >> On Wed, 18 Oct 2006, Stéphane Dray wrote:
>    >>
>    >>> Hello,
>    >>> I have some questions concerning encoding and package distribution. We
>    >>> develop the ade4 package. For some data sets included in the package,
>    >>> there are accentued character (e.g. é,è...). The data sets have been
>    >>> saved using latin1 encoding, but some of us use utf-8 and can not see
>    >>> some data sets which contains accented chracters.
>    >>> e.g:
>    >>>
>    >>> librarry(ade4)
>    >>> data(rankrock)
>    >>> rankrock
>    >>>
>    >>> in this case, characters are in rownames. Other data sets have such
>    >>> characters in data (e.g. levels of factors..). A solution is to use
>    >>> iconv... this is quite easy for us but perhaps more difficult for a user
>    >>> which can have no idea of the problem. This problem is quite marginal
>    >>> for the moment but some linux distribution are utf-8 by default (e.g.
>    >>> ubuntu) and I suppose that the problem will be more and more present in
>    >>> the future.
>    >>>
>    >>> So we wonder if there is a proper way to code and save these data sets.
>    >>> I have found some documents of B. Ripley and this note :
>    >>>
>    >>> http://developer.r-project.org/210update.txt
>    >>>
>    >>> -  Names in data objects (e.g. in .rda files) are problematic.  It
>    >>> is likely that by release time these will be treated as in
>    >>> Latin-1.
>    >>>
>    >>> If I am correct, I did not find an answer to this problem.
>    >>>
>    >>> What are the plans of R gurus on this question ?
>    >>> Thanks a lot.
>    >>> Sincerely.
>    >>>
>    >>> Please add my adress in answers as I am not subsciber of this list.
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-devel mailing list