[Rd] collation order
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Fri Mar 17 23:56:29 CET 2006
Thomas Lumley <tlumley at u.washington.edu> writes:
> The following caused a hard-to-diagnose problem for a user of the survey
> package. Presumably this is a strange Unicode thing, but is there a
> convenient reference for how the collation order is determined? I am
> surprised that adding the same character to the end of two strings of the
> same length can change the sorting order.
>
> in en_US.utf8 locale
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] FALSE
>
> in C locale on the same system.
> > "1//"<"10/"
> [1] TRUE
> > "1//2"<"10/2"
> [1] TRUE
>
> [This is in r-devel of March 6, but the problem that was reported to me
> involved Windows vs Linux on released versions]
Unicode has nothing to do with it (same thing in ISO-8859-1. It is
(I think) about characters being skipped during collating, i.e. same
effect as this:
> Sys.setlocale(locale="C")
[1] "C"
> "Thomas O'Malley" < "Thomas Lumley"
[1] TRUE
> Sys.setlocale(locale="en_US.UTF8")
[1] "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C"
> "Thomas O'Malley" <" Thomas Lumley"
[1] FALSE
>
> -thomas
>
> Thomas Lumley Assoc. Professor, Biostatistics
> tlumley at u.washington.edu University of Washington, Seattle
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-devel
mailing list