[Rd] latin1,utf-8...encoding and data
Martin Maechler
maechler at stat.math.ethz.ch
Thu Oct 19 15:26:55 CEST 2006
>>>>> "Stéphane" == Stéphane Dray <dray at biomserv.univ-lyon1.fr>
>>>>> on Thu, 19 Oct 2006 09:46:49 +0200 writes:
Stéphane> Thanks a lot for this clear answer. So there is no way to preserve our
Stéphane> french cultural exception (accented characters),
I agree that there are many French cultural exceptions ;-)
--- and as a Swiss, I highly estimate several of them ---
however "accented" characters (with the appropriate meaning of "accented")
are not at all a French exception, rather almost a continental
European one {as long as we are staying in the "latin" alphabet
context}. If I think of what I know of Europe, the only
country/language *not* using some version of "accented"
characters are the British and (I think) the Dutch/Flamish.
Everyone else (? probably I forgot some, and don't know about others
like gaelic,...) has some kind of accents...
I agree with Stéphane that this is unfortunate for quite a few
of us, and it came as a big surprise to me when I first heard
about this from Brian. .. aah, life was easy when we western
chauvinists could behave as if the whole relevant part of the
world was happy with iso-latin1...
Martin
Stéphane> if we want to be international... I have thought
Stéphane> that the inclusion of a parameter encoding in data
Stéphane> function (e.g. data(mydata,encoding="latin1"))
Stéphane> like in the function 'file' could be an way to
Stéphane> solve the problem. Apparently, the problem is much
Stéphane> more complicated...
Stéphane> Sincerely.
Stéphane> Prof Brian Ripley wrote:
>> Only ASCII letters are portable: those accented characters do not even
>> exist in many of the encodings used for R, e.g. Russian and Japanese
>> on Windows machines.
>>
>> There is no way to associate an encoding with a character string in
>> R. We considered it, but it would have had severe back-compatibility
>> problems and little advantage (you cannot display non-ASCII character
>> strings portably: even if you have a Unicode encoding you still need
>> to select a suitable font).
>>
>> 'B. Ripley' (sic)
>>
>>
>> On Wed, 18 Oct 2006, Stéphane Dray wrote:
>>
>>> Hello,
>>> I have some questions concerning encoding and package distribution. We
>>> develop the ade4 package. For some data sets included in the package,
>>> there are accentued character (e.g. é,è...). The data sets have been
>>> saved using latin1 encoding, but some of us use utf-8 and can not see
>>> some data sets which contains accented chracters.
>>> e.g:
>>>
>>> librarry(ade4)
>>> data(rankrock)
>>> rankrock
>>>
>>> in this case, characters are in rownames. Other data sets have such
>>> characters in data (e.g. levels of factors..). A solution is to use
>>> iconv... this is quite easy for us but perhaps more difficult for a user
>>> which can have no idea of the problem. This problem is quite marginal
>>> for the moment but some linux distribution are utf-8 by default (e.g.
>>> ubuntu) and I suppose that the problem will be more and more present in
>>> the future.
>>>
>>> So we wonder if there is a proper way to code and save these data sets.
>>> I have found some documents of B. Ripley and this note :
>>>
>>> http://developer.r-project.org/210update.txt
>>>
>>> - Names in data objects (e.g. in .rda files) are problematic. It
>>> is likely that by release time these will be treated as in
>>> Latin-1.
>>>
>>> If I am correct, I did not find an answer to this problem.
>>>
>>> What are the plans of R gurus on this question ?
>>> Thanks a lot.
>>> Sincerely.
>>>
>>> Please add my adress in answers as I am not subsciber of this list.
More information about the R-devel
mailing list