[Rd] collation order

Fri Mar 17 23:56:29 CET 2006

Thomas Lumley <tlumley at u.washington.edu> writes:

> The following caused a hard-to-diagnose problem for a user of the survey 
> package.  Presumably this is a strange Unicode thing, but is there a 
> convenient reference for how the collation order is determined? I am 
> surprised that adding the same character to the end of two strings of the 
> same length can change the sorting order.
> 
> in en_US.utf8 locale
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] FALSE
> 
> in C locale on the same system.
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] TRUE
> 
> [This is in r-devel of March 6, but the problem that was reported to me 
> involved Windows vs Linux on released versions]

Unicode has nothing to do with it (same thing in ISO-8859-1. It is
(I think) about characters being skipped during collating, i.e. same
effect as this:

> Sys.setlocale(locale="C")
[1] "C"
> "Thomas  O'Malley" < "Thomas Lumley"
[1] TRUE
> Sys.setlocale(locale="en_US.UTF8")
[1] "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"
> "Thomas  O'Malley" <" Thomas Lumley"
[1] FALSE

> 
>  	-thomas
> 
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907