[Rd] collation order
Simon Urbanek
simon.urbanek at r-project.org
Fri Mar 17 23:32:39 CET 2006
On Mar 17, 2006, at 4:32 PM, Thomas Lumley wrote:
> The following caused a hard-to-diagnose problem for a user of the
> survey package. Presumably this is a strange Unicode thing,
It is independent of the encoding:
urbanek at corrino:~$ LC_COLLATE=en_US R --vanilla -q<tr
> "1//"<"10/"
[1] TRUE
> "1//2"<"10/2"
[1] FALSE
> Sys.getlocale("LC_COLLATE")
[1] "en_US"
(en_US is ISO-8859-1 on that machine)
And systems don't seem to agree on anything but C locale:
Mac OS X:
caladan:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr
> "1//"<"10/"
[1] TRUE
> "1//2"<"10/2"
[1] TRUE
> Sys.getlocale("LC_COLLATE")
[1] "en_US"
IRIX:
fry:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr
> "1//"<"10/"
[1] FALSE
> "1//2"<"10/2"
[1] FALSE
> Sys.getlocale("LC_COLLATE")
[1] "en_US"
But at least most systems are consistent in terms of adding a
character, except for GNU/Linux.
Looking at the locale definitions, GNU/Linux uses "iso14651_t1"
template for many languages. Maybe the problem is that "/" is defined
in the "SPECIAL" section of the ISO-14651 template, which possibly
causes / to be completely ignored in the "LATIN" part, which would
explain the behavior (("1"<"10")==TRUE, ("12"<"102")==FALSE). I
couldn't find anything on what the "offical" en_** collating should
be so I have no idea whether this is a bug in the GNU/Linux locales
or not...
Cheers,
Simon
More information about the R-devel
mailing list