[R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Feb 2 14:16:17 CET 2011
On Wed, 2 Feb 2011, Søren Højsgaard wrote:
> Not sure if I qualify as being knowledgeable, but...
>
> You write
>
>> I recall that 'aa' used to sort at the end of the alphabet in Danish
>> telephone books, so it seems the sort used on Windows thinks so too. See
>> ?Comparison for some further details. What I don't understand is that
>> someone resident in Denmark finds this strange ....
>
> Yes, I can confirm that "aa" resides at the end of the Danish
> alphabet (as an old way of writing the letter which in modern
> writing is "å"). But what should one then do if one wants "aa" to
> mean "an a followed by another a" and not "aa" (="å") when calling
> sort??
Set Sys.setlocale("LC_COLLATE", "") appropriately (sorry, that's very
OS-specific but 'C' and 'en' or 'English' probably work on Windows).
On platforms using ICU (most, but not Windows), see also
?icuSetCollate for further ways to tweak collation: that has "aarhus"
in its examples.
>
> Regards
> Søren Højsgaard
>
> -----Oprindelig meddelelse-----
> Fra: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sendt: 2. februar 2011 13:21
> Til: Søren Højsgaard
> Cc: r-help at stat.math.ethz.ch
> Emne: Re: [R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7
>
> 'Strange' to have no response on this. Can a knowledgeable Danish
> writer please confirm that this is how the OSes are supposed to handle
> Danish collation?
>
> On Mon, 24 Jan 2011, Prof Brian Ripley wrote:
>
>> On Mon, 24 Jan 2011, Søren Højsgaard wrote:
>>
>>> Dear list,
>>>
>>> Please consider the following call of sort
>>>
>>>> sort(c("a","f"))
>>> [1] "a" "f"
>>>> sort(c("f","a"))
>>> [1] "a" "f"
>>>>
>>>> sort(c("aa","ff"))
>>> [1] "ff" "aa"
>>>> sort(c("ff","aa"))
>>> [1] "ff" "aa"
>>> The last two results look strange to me. Is that a bug???
>>
>> It seems that you and your OS disagree about Danish, and I'm in no position
>> to know which is correct. But this is not an R issue: the sorting is done by
>> OS services.
>>
>>> The result seems to come from calls to order:
>>>
>>>> order(c("a","f"))
>>> [1] 1 2
>>>> order(c("f","a"))
>>> [1] 2 1
>>>>
>>>> order(c("aa","ff"))
>>> [1] 2 1
>>>> order(c("ff","aa"))
>>> [1] 1 2
>>
>>> I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows 7.
>>> However on Linux, I get the "right answer" (the answer I expected). From
>>> the help pages I get the impression that there might be an issue about
>>> locale, but I didn't understand the details.
>>>
>>> Can anyone tell me what goes on here, please
>>
>> I recall that 'aa' used to sort at the end of the alphabet in Danish
>> telephone books, so it seems the sort used on Windows thinks so too. See
>> ?Comparison for some further details. What I don't understand is that
>> someone resident in Denmark finds this strange ....
>>
>> I get exactly the same in a Danish locale on Mac OS X, for example:
>>
>>> sort(c("aa","ff"))
>> [1] "ff" "aa"
>>
>> and also on my Linux box (Fedora 14 with LC_COLLATE=da_DK.utf8)
>>
>>> sort(c("aa","ff"))
>> [1] "ff" "aa"
>>
>> en_DK is not a Danish locale (in is English in Denmark). If you want an
>> English sort, try an English locale for LC_COLLATE (there may well be
>> several, hence 'an').
>>
>>>
>>> Regards
>>> Søren
>>>
>>>
>>>
>>>
>>>
>>>
>>>> sessionInfo()
>>> R version 2.12.1 Patched (2010-12-27 r53883)
>>> Platform: i386-pc-mingw32/i386 (32-bit)
>>> locale:
>>> [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252
>>> [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
>>> [5] LC_TIME=Danish_Denmark.1252
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>> other attached packages:
>>> [1] SHDtools_1.0
>>>
>>>
>>>> sessionInfo()
>>> R version 2.12.1 (2010-12-16)
>>> Platform: i686-pc-linux-gnu (32-bit)
>>> locale:
>>> [1] LC_CTYPE=en_DK.utf8 LC_NUMERIC=C
>>> [3] LC_TIME=en_DK.utf8 LC_COLLATE=en_DK.utf8
>>> [5] LC_MONETARY=C LC_MESSAGES=en_DK.utf8
>>> [7] LC_PAPER=en_DK.utf8 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Brian D. Ripley, ripley at stats.ox.ac.uk
>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford, Tel: +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> --
> Brian D. Ripley, ripley at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list