[R-SIG-Mac] a question of alphabetical order [follow-up]

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Apr 17 10:21:04 CEST 2008


Your 'nightmare' does seems specific to Mac OS.  Your example is collated 
correctly in all the es_ES locales on my Linux box, and also in 
es_ES.UTF-8 on Solaris 10.

We have no idea what data you collected to assert 'whatever platform we 
use'.  UTF-8 locales on Mac OS X are the only instance in C where I am 
aware of the use of Unicode point order (quite a few scripting languages 
do it, though).  If the problem were widespread I would expect it to be 
reported more than it is (and 'ls' output is locale-specific in recent 
versions of Linux, and my IT team did get several help requests about 
that).

Collation is a tricky area, but that does not mean that OS designers are 
in general shy of it.  There was a concerted project, the Unicode 
Collation Algorithm, and several OSes have implementations including 
national 'tailorings'.

What can be done about it?  The obvious answer is to use a reliable OS. 
Alternatively, R is making use of the system's C collation functions and 
those could be replaced.  In current R (>= 2.7.0) this is centralized in 
src/main/utils.c, in the code (not Windows)

# ifdef HAVE_STRCOLL
#  define STRCOLL strcoll
# else
#  define STRCOLL strcmp
# endif

int Scollate(SEXP a, SEXP b)
{
     return STRCOLL(translateChar(a), translateChar(b));
}

Mac OS X has strcoll (it is a C99 function, so that test is historical), 
and what would be needed would be to replace it by a more functional 
version.  My suspicion is that Mac OS X does have proper collation 
functionality (http://en.wikipedia.org/wiki/Common_Locale_Data_Repository 
appears to claim it uses CDLR data), but that it is not used in the ISO 
C99 part of the OS.  For example, Cocoa seems to have a function 
'localizedCompare'.


BTW, Ei-ji Nakama was already replaced the broken wctype and wcwidth 
functions in Mac OS: see file src/main/rlocale.c


On Wed, 16 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote:

> Hi,
>
> This issue comes from a thread of the same title, "a question of
> alphabetical order", initiated yesterday in r-help at r-project.org list.
> As it affects  now only Mac environment, I follow Brian Ripley's advice
> and move it to this list.
>
> It is now clear that ordering lists/variable values is a kind of
> nightmare whatever platform we use. As I (and possible many others!)
> need to get a right order, or an "as right as possible" order, for list
> of strings using non-ASCII character, namely áéíóú, ÁÉÍÓÚ and ñ,Ñ, we
> have been considering a number of options.
>
> Hans-Joerg Bibiko proposed a customized function to do the trick. Brian
> Ripley spoke about es_ES.ISO8859-15 doing almost the right thing for
> these characters.
>
> Here what I get working in a MacBook which environment I describe at the
> bottom of the message:
>
> http://mire.environmentalchange.net/~webmaster/images/toPlot.png
>
> Here the code:
>
> png(file="toPlot.png", pointsize = 14, width = 1000, height = 480, units
> = "px", bg="#eaedd5")
> Sys.setlocale(category = "LC_ALL", locale = "es_ES.ISO8859-15")
> toPlot <- data.frame(medio=c("avión", "barco", "bicicleta", "ángulo",
> "choco", "camión", "coche", "tren", "aleta", "luna", "llave"),
> variable=c(34, 33, 3, 37, 54, 23, 67, 30, 23, 56, 13))
> toPlot<-toPlot[order(toPlot$medio),]
> Sys.setlocale(category = "LC_ALL", locale = "en_GB.UTF-8")
> barplot(toPlot$variable,names.arg=toPlot$medio)
> dev.off()
>
> As you see in the order of labels, accent is not ignored, and ch and ll
> are considered as single instances. These are not longer the case with
> Spanish alphabetical order. It changed in 1994.
>
> So, Hans's solution seems the only one available to the correct order.
> At least working with in the environment described below.
>
> In any case, please,
>
> 1. Are you aware of any new locale we could try to see if it is already
> updated?
> 2. If it doesn't exist, how/where must we go to propose/start creating
> such e locale?
>
> Here the environment:
>
> > version
>               _
> platform       i386-apple-darwin9.2.2
> arch           i386
> os             darwin9.2.2
> system         i386, darwin9.2.2
> status         beta
> major          2
> minor          7.0
> year           2008
> month          04
> day            12
> svn rev        45280
> language       R
> version.string R version 2.7.0 beta (2008-04-12 r45280)
> > sessionInfo()
> R version 2.7.0 beta (2008-04-12 r45280)
> i386-apple-darwin9.2.2
>
> locale:
> en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> >
>
> R GUI 1.24-devel (5072)
>
> Thank you so much for your help,
>
> Ricardo
>
> --
> Ricardo Rodríguez
> Your XEN ICT Team
>
> _______________________________________________
> R-SIG-Mac mailing list
> R-SIG-Mac at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-SIG-Mac mailing list