[Rd] collation order

Fri Mar 17 23:32:39 CET 2006

On Mar 17, 2006, at 4:32 PM, Thomas Lumley wrote:

> The following caused a hard-to-diagnose problem for a user of the  
> survey package.  Presumably this is a strange Unicode thing,

It is independent of the encoding:

urbanek at corrino:~$ LC_COLLATE=en_US R --vanilla -q<tr
 > "1//"<"10/"
[1] TRUE
 > "1//2"<"10/2"
[1] FALSE
 > Sys.getlocale("LC_COLLATE")
[1] "en_US"

(en_US is ISO-8859-1 on that machine)

And systems don't seem to agree on anything but C locale:

Mac OS X:
caladan:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr
 > "1//"<"10/"
[1] TRUE
 > "1//2"<"10/2"
[1] TRUE
 > Sys.getlocale("LC_COLLATE")
[1] "en_US"

IRIX:
fry:urbanek$ LC_COLLATE=en_US R --vanilla -q<tr
 > "1//"<"10/"
[1] FALSE
 > "1//2"<"10/2"
[1] FALSE
 > Sys.getlocale("LC_COLLATE")
[1] "en_US"

But at least most systems are consistent in terms of adding a  
character, except for GNU/Linux.

Looking at the locale definitions, GNU/Linux uses "iso14651_t1"  
template for many languages. Maybe the problem is that "/" is defined  
in the "SPECIAL" section of the ISO-14651 template, which possibly  
causes / to be completely ignored in the "LATIN" part, which would  
explain the behavior (("1"<"10")==TRUE, ("12"<"102")==FALSE). I  
couldn't find anything on what the "offical" en_** collating should  
be so I have no idea whether this is a bug in the GNU/Linux locales  
or not...

Cheers,
Simon