[R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7

Søren Højsgaard Soren.Hojsgaard at agrsci.dk
Wed Feb 2 13:56:28 CET 2011


Not sure if I qualify as being knowledgeable, but... 

You write

> I recall that 'aa' used to sort at the end of the alphabet in Danish 
> telephone books, so it seems the sort used on Windows thinks so too. See 
> ?Comparison for some further details.  What I don't understand is that 
> someone resident in Denmark finds this strange ....

Yes, I can confirm that "aa" resides at the end of the Danish alphabet (as an old way of writing the letter which in modern writing is "å").
But what should one then do if one wants "aa" to mean "an a followed by another a" and not "aa" (="å") when calling sort??

Regards
Søren Højsgaard

-----Oprindelig meddelelse-----
Fra: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] 
Sendt: 2. februar 2011 13:21
Til: Søren Højsgaard
Cc: r-help at stat.math.ethz.ch
Emne: Re: [R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7

'Strange' to have no response on this.  Can a knowledgeable Danish 
writer please confirm that this is how the OSes are supposed to handle 
Danish collation?

On Mon, 24 Jan 2011, Prof Brian Ripley wrote:

> On Mon, 24 Jan 2011, Søren Højsgaard wrote:
>
>> Dear list,
>> 
>> Please consider the following call of sort
>> 
>>> sort(c("a","f"))
>> [1] "a" "f"
>>> sort(c("f","a"))
>> [1] "a" "f"
>>> 
>>> sort(c("aa","ff"))
>> [1] "ff" "aa"
>>> sort(c("ff","aa"))
>> [1] "ff" "aa"
>> The last two results look strange to me. Is that a bug???
>
> It seems that you and your OS disagree about Danish, and I'm in no position 
> to know which is correct.  But this is not an R issue: the sorting is done by 
> OS services.
>
>> The result seems to come from calls to order:
>> 
>>> order(c("a","f"))
>> [1] 1 2
>>> order(c("f","a"))
>> [1] 2 1
>>> 
>>> order(c("aa","ff"))
>> [1] 2 1
>>> order(c("ff","aa"))
>> [1] 1 2
>
>> I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows 7. 
>> However on Linux, I get the "right answer" (the answer I expected). From 
>> the help pages I get the impression that there might be an issue about 
>> locale, but I didn't understand the details.
>> 
>> Can anyone tell me what goes on here, please
>
> I recall that 'aa' used to sort at the end of the alphabet in Danish 
> telephone books, so it seems the sort used on Windows thinks so too. See 
> ?Comparison for some further details.  What I don't understand is that 
> someone resident in Denmark finds this strange ....
>
> I get exactly the same in a Danish locale on Mac OS X, for example:
>
>> sort(c("aa","ff"))
> [1] "ff" "aa"
>
> and also on my Linux box (Fedora 14 with LC_COLLATE=da_DK.utf8)
>
>> sort(c("aa","ff"))
> [1] "ff" "aa"
>
> en_DK is not a Danish locale (in is English in Denmark).  If you want an 
> English sort, try an English locale for LC_COLLATE (there may well be 
> several, hence 'an').
>
>> 
>> Regards
>> Søren
>> 
>> 
>> 
>> 
>> 
>> 
>>> sessionInfo()
>> R version 2.12.1 Patched (2010-12-27 r53883)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>> locale:
>> [1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252
>> [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
>> [5] LC_TIME=Danish_Denmark.1252
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> other attached packages:
>> [1] SHDtools_1.0
>> 
>> 
>>> sessionInfo()
>> R version 2.12.1 (2010-12-16)
>> Platform: i686-pc-linux-gnu (32-bit)
>> locale:
>> [1] LC_CTYPE=en_DK.utf8       LC_NUMERIC=C
>> [3] LC_TIME=en_DK.utf8        LC_COLLATE=en_DK.utf8
>> [5] LC_MONETARY=C             LC_MESSAGES=en_DK.utf8
>> [7] LC_PAPER=en_DK.utf8       LC_NAME=C
>> [9] LC_ADDRESS=C              LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list