[R] difference in sort order linux/Windows (R.2.11.0)

Daniel Nordlund djnordlund at verizon.net
Fri May 28 23:13:26 CEST 2010


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> On Behalf Of Ted Harding
> Sent: Friday, May 28, 2010 1:15 PM
> To: r-help at r-project.org
> Cc: carslaw
> Subject: Re: [R] difference in sort order linux/Windows (R.2.11.0)
> 
> On 28-May-10 14:37:39, Duncan Murdoch wrote:
> > On 28/05/2010 9:24 AM, (Ted Harding) wrote:
> >> An experiment:
> >>
> >>   sort(c("AACD","A CD"))
> >>   #  [1] "AACD" "A CD"
> >>
> >>   sort(c("ABCD","A CD"))
> >>   #  [1] "ABCD" "A CD"
> >>
> >>   sort(c("ACCD","A CD"))
> >>   #  [1] "ACCD" "A CD"
> >>
> >>   sort(c("ADCD","A CD"))
> >>   #  [1] "A CD" "ADCD"
> >>
> >>   sort(c("AECD","A CD"))
> >>   #  [1] "A CD" "AECD"
> >>   ## (with results for "AFCD", ... "AZCD" similar to the last two).
> >>
> >>   LC_COLLATE=en_GB.UTF-8
> >>
> >> (R version 2.11.0 (2010-04-22) on Linux).
> >>
> >> So this behaves, in en_GB.UTF-8, as though " " (SPACE) is between
> >> "C" and "D".
> >>
> >> This is nuts!!!
> >>
> >> Curable if I set (e.g.) LC_LOCALE="C" on startup. But what else
> >> might break if I do so?
> >>
> >
> > You have to realize that to a large extent this is not under our
> > control. Your system will have linked to some library (outside of R)
> > to do string collation, and the problem lies in that library. You
> > should determine which system library is handling your collations.
> >
> > I'd like to tell you how to do that, but I don't know for your build.
> > You can find out if you're using the recommended ICU library by
> > running example(icuSetCollate); that gives a number of warnings like
> >
> > In icuSetCollate(locale = "da_DK", case_first = "default") :
> >   ICU is not supported on this build
> >
> > in Windows.  If you don't see those, then you want to talk to the ICU
> > people.  If you do, then you'll need to look deeper to find out what
> > you're actually using.
> >
> > Duncan Murdoch
> 
> Thanks for the further guidance, Duncan. I indeed get 4 such warnings
> from example(icuSetCollate), indicating that ICU is not being used.
> 
> I have now thrown the above experiment straight at Linux, entering
> command-line commands as follows (with the results shown on the
> lines starting with "#"):
> 
> sort << EOT
> "AACD"
> "A CD"
> EOT
> # "AACD"
> # "A CD"
> 
> sort << EOT
> "ABCD"
> "A CD"
> EOT
> # "ABCD"
> # "A CD"
> 
> sort << EOT
> "ACCD"
> "A CD"
> EOT
> # "ACCD"
> # "A CD"
> 
> sort << EOT
> "ADCD"
> "A CD"
> EOT
> # "A CD"
> # "ADCD"
> 
> This clearly shows that the Linux collating order sees " " (SPACE)
> as coming between "C" and "D", as when I tried it in R.
> 
> I am now spamming my Linux contacts about it!
> 
> The result of the "locale" command in Linux includes:
>   LC_COLLATE="en_GB.UTF-8"
> 
> This happens consistently on a Debian Lenny and a Debian Etch system.
> 
> Thanks,
> Ted.
> 

Maybe asking on R-sig-Debian could be of some help.

https://stat.ethz.ch/mailman/listinfo/r-sig-debian

Hope this is helpful,

Dan

Daniel Nordlund
Bothell, WA USA



More information about the R-help mailing list