[R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)
Hans-Jörg Bibiko
bibiko at eva.mpg.de
Sun Jun 1 21:50:59 CEST 2008
On 31.05.2008, at 00:11, Prof Brian Ripley wrote:
> On Fri, 30 May 2008, Duncan Murdoch wrote:
>> But I think with Brian Ripley's work over the last while, R for
>> Windows actually handles utf-8 pretty well. (It might not guess
>> at that encoding, but if you tell it that's what you're using...)
Yes. I already mentioned that there was a big step from R 2.6 to R
2.7 for Windows regarding the support of UTF-8.
> R passes around, prints and plots UTF-8 character data pretty well,
> but it translates to the native encoding for almost all character-
> level manipulations (and not just on Windows). ?Encoding spells
> out the exceptions (and I think the original poster had not read
> it). As time goes on we may add more, but it is really tedious
> (and somewhat error-prone) to have multiple paths through the code
> for different encodings (and different OSes do handle these
> differently -- Windows' use of UTF-16 means that one character may
> not be one wchar_t).
R is becoming more and more popular amongst philologists, linguistics
etc. It is very nice to have one software environment to gather,
analyze, and visualize data based on texts. But, e.g. linguists are
dealing very often with more than one language at the same time.
That's why they have to use an Unicode encoding.
In R they have to use any functions dealing with characters, like
nchar, strsplit, grep/gsub, to lower/upper case etc.
These functions are, more or less, based on the underlying locale
settings. But why?
It is a very very painful task to write functions for different
encodings on different platforms. Thus I wonder whether it would be
possible to switch internally to one Unicode encoding. If one
considers e.g. the memory usage UTF-8 would be an option. Of course,
such a change will be REALLY a BIG challenge in terms of effort,
speed, compatibility, etc. This would also mean to avoid the usage of
system libraries.
Maybe this would be a task for R 4.0 or it will be my eternal private
dream :)
OK. Let me be a bit more realistic.
An other issue is the used regular expression engine. On a Mac or
UNIX machine one can set a UTF-8 locale. Fine. But these locales
aren't available under Windows (yet?). Maybe it's worth to have a
look at other regexp engines like Oniguruma ( http://www.geocities.jp/
kosako3/oniguruma/ ). It supports, among others, all Unicode
encodings. It is used in many applications. I do not know how
difficult it will be to implement such a library in R. But this would
solve, I guess, 80% of the problems of R users who are interested in
text analyzing. nchar, strsplit, grep etc. could make usage of it.
Maybe one could write such a package for Windows (maybe also for Mac/
UNIX, because Oniguruma has some very nice additional features). Of
course, a string should be piped as an UTF-8 byte stream to the
Oniguruma lib, and I do not know whether this is easily possible in R
for Windows.
Once again, thanks for all the effort done to set up such a wonderful
piece of software.
--Hans
More information about the R-help
mailing list