[R] Strange result from sort: sort(c("aa", "ff")) gives "ff" "aa" with R.2.12.1 on windows 7

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Jan 24 23:18:20 CET 2011


On Mon, 24 Jan 2011, Søren Højsgaard wrote:

> Dear list,
>
> Please consider the following call of sort
>
>> sort(c("a","f"))
> [1] "a" "f"
>> sort(c("f","a"))
> [1] "a" "f"
>>
>> sort(c("aa","ff"))
> [1] "ff" "aa"
>> sort(c("ff","aa"))
> [1] "ff" "aa"
> The last two results look strange to me. Is that a bug???

It seems that you and your OS disagree about Danish, and I'm in no 
position to know which is correct.  But this is not an R issue: the 
sorting is done by OS services.

> The result seems to come from calls to order:
>
>> order(c("a","f"))
> [1] 1 2
>> order(c("f","a"))
> [1] 2 1
>>
>> order(c("aa","ff"))
> [1] 2 1
>> order(c("ff","aa"))
> [1] 1 2

> I get the same results on R.2.12.1, R.2.11.1 and R.2.13.0 on Windows 
> 7. However on Linux, I get the "right answer" (the answer I 
> expected). From the help pages I get the impression that there might 
> be an issue about locale, but I didn't understand the details.
>
> Can anyone tell me what goes on here, please

I recall that 'aa' used to sort at the end of the alphabet in Danish 
telephone books, so it seems the sort used on Windows thinks so too. 
See ?Comparison for some further details.  What I don't understand is 
that someone resident in Denmark finds this strange ....

I get exactly the same in a Danish locale on Mac OS X, for example:

> sort(c("aa","ff"))
[1] "ff" "aa"

and also on my Linux box (Fedora 14 with LC_COLLATE=da_DK.utf8)

> sort(c("aa","ff"))
[1] "ff" "aa"

en_DK is not a Danish locale (in is English in Denmark).  If you want 
an English sort, try an English locale for LC_COLLATE (there may well 
be several, hence 'an').

>
> Regards
> Søren
>
>
>
>
>
>
>> sessionInfo()
> R version 2.12.1 Patched (2010-12-27 r53883)
> Platform: i386-pc-mingw32/i386 (32-bit)
> locale:
> [1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252
> [3] LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C
> [5] LC_TIME=Danish_Denmark.1252
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] SHDtools_1.0
>
>
>> sessionInfo()
> R version 2.12.1 (2010-12-16)
> Platform: i686-pc-linux-gnu (32-bit)
> locale:
> [1] LC_CTYPE=en_DK.utf8       LC_NUMERIC=C
> [3] LC_TIME=en_DK.utf8        LC_COLLATE=en_DK.utf8
> [5] LC_MONETARY=C             LC_MESSAGES=en_DK.utf8
> [7] LC_PAPER=en_DK.utf8       LC_NAME=C
> [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_DK.utf8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-help mailing list