[R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7

Wed Feb 2 14:16:17 CET 2011

On Wed, 2 Feb 2011, Søren Højsgaard wrote:

> Not sure if I qualify as being knowledgeable, but...
>
> You write
>
>> I recall that 'aa' used to sort at the end of the alphabet in Danish
>> telephone books, so it seems the sort used on Windows thinks so too. See
>> ?Comparison for some further details.  What I don't understand is that
>> someone resident in Denmark finds this strange ....
>
> Yes, I can confirm that "aa" resides at the end of the Danish 
> alphabet (as an old way of writing the letter which in modern 
> writing is "å"). But what should one then do if one wants "aa" to 
> mean "an a followed by another a" and not "aa" (="å") when calling 
> sort??

Set Sys.setlocale("LC_COLLATE", "") appropriately (sorry, that's very 
OS-specific but 'C' and 'en' or 'English' probably work on Windows). 
On platforms using ICU (most, but not Windows), see also 
?icuSetCollate for further ways to tweak collation: that has "aarhus" 
in its examples.

>
> Regards
> Søren Højsgaard
>
> -----Oprindelig meddelelse-----
> Fra: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sendt: 2. februar 2011 13:21
> Til: Søren Højsgaard
> Cc: r-help at stat.math.ethz.ch
> Emne: Re: [R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7
>
> 'Strange' to have no response on this.  Can a knowledgeable Danish
> writer please confirm that this is how the OSes are supposed to handle
> Danish collation?
>
> On Mon, 24 Jan 2011, Prof Brian Ripley wrote:
>
>> On Mon, 24 Jan 2011, Søren Højsgaard wrote:
>>
>>> Dear list,
>>>
>>> Please consider the following call of sort
>>>
>>>> sort(c("a","f"))
>>> [1] "a" "f"
>>>> sort(c("f","a"))
>>> [1] "a" "f"
>>>>
>>>> sort(c("aa","ff"))
>>> [1] "ff" "aa"
>>>> sort(c("ff","aa"))
>>> [1] "ff" "aa"
>>> The last two results look strange to me. Is that a bug???
>>
>> It seems that you and your OS disagree about Danish, and I'm in no position
>> to know which is correct.  But this is not an R issue: the sorting is done by
>> OS services.
>>
>>> The result seems to come from calls to order:
>>>
>>>> order(c("a","f"))
>>> [1] 1 2
>>>> order(c("f","a"))
>>> [1] 2 1
>>>>
>>>> order(c("aa","ff"))
>>> [1] 2 1
>>>> order(c("ff","aa"))
>>> [1] 1 2
>>
>>> I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows 7.
>>> However on Linux, I get the "right answer" (the answer I expected). From
>>> the help pages I get the impression that there might be an issue about
>>> locale, but I didn't understand the details.
>>>
>>> Can anyone tell me what goes on here, please
>>
>> I recall that 'aa' used to sort at the end of the alphabet in Danish
>> telephone books, so it seems the sort used on Windows thinks so too. See
>> ?Comparison for some further details.  What I don't understand is that
>> someone resident in Denmark finds this strange ....
>>
>> I get exactly the same in a Danish locale on Mac OS X, for example:
>>
>>> sort(c("aa","ff"))
>> [1] "ff" "aa"
>>
>> and also on my Linux box (Fedora 14 with LC_COLLATE=da_DK.utf8)
>>
>>> sort(c("aa","ff"))
>> [1] "ff" "aa"
>>
>> en_DK is not a Danish locale (in is English in Denmark).  If you want an
>> English sort, try an English locale for LC_COLLATE (there may well be
>> several, hence 'an').
>>
>>>
>>> Regards
>>> Søren
>>>
>>>
>>>
>>>
>>>
>>>
>>>> sessionInfo()
>>> R version 2.12.1 Patched (2010-12-27 r53883)
>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>> locale:
>>> [1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252
>>> [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
>>> [5] LC_TIME=Danish_Denmark.1252
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> other attached packages:
>>> [1] SHDtools_1.0
>>>
>>>
>>>> sessionInfo()
>>> R version 2.12.1 (2010-12-16)
>>> Platform: i686-pc-linux-gnu (32-bit)
>>> locale:
>>> [1] LC_CTYPE=en_DK.utf8       LC_NUMERIC=C
>>> [3] LC_TIME=en_DK.utf8        LC_COLLATE=en_DK.utf8
>>> [5] LC_MONETARY=C             LC_MESSAGES=en_DK.utf8
>>> [7] LC_PAPER=en_DK.utf8       LC_NAME=C
>>> [9] LC_ADDRESS=C              LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595