[R] difference in sort order linux/Windows (R.2.11.0)
Duncan Murdoch
murdoch.duncan at gmail.com
Fri May 28 16:37:39 CEST 2010
On 28/05/2010 9:24 AM, (Ted Harding) wrote:
> An experiment:
>
> sort(c("AACD","A CD"))
> # [1] "AACD" "A CD"
>
> sort(c("ABCD","A CD"))
> # [1] "ABCD" "A CD"
>
> sort(c("ACCD","A CD"))
> # [1] "ACCD" "A CD"
>
> sort(c("ADCD","A CD"))
> # [1] "A CD" "ADCD"
>
> sort(c("AECD","A CD"))
> # [1] "A CD" "AECD"
> ## (with results for "AFCD", ... "AZCD" similar to the last two).
>
> LC_COLLATE=en_GB.UTF-8
>
> (R version 2.11.0 (2010-04-22) on Linux).
>
> So this behaves, in en_GB.UTF-8, as though " " (SPACE) is between
> "C" and "D".
>
> This is nuts!!!
>
> Curable if I set (e.g.) LC_LOCALE="C" on startup. But what else
> might break if I do so?
>
You have to realize that to a large extent this is not under our
control. Your system will have linked to some library (outside of R) to
do string collation, and the problem lies in that library. You should
determine which system library is handling your collations.
I'd like to tell you how to do that, but I don't know for your build.
You can find out if you're using the recommended ICU library by running
example(icuSetCollate); that gives a number of warnings like
In icuSetCollate(locale = "da_DK", case_first = "default") :
ICU is not supported on this build
in Windows. If you don't see those, then you want to talk to the ICU
people. If you do, then you'll need to look deeper to find out what
you're actually using.
Duncan Murdoch
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 28-May-10 Time: 14:24:08
> ------------------------------ XFMail ------------------------------
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list