[R] a question of alphabetical order
[Ricardo Rodriguez] Your XEN ICT Team
webmaster at xen.net
Tue Apr 15 23:31:00 CEST 2008
Tricky question, this order issue :-(
Thank you so much for the detailed explanation.
Thus, please, must I conclude that I will have to survive with this
ASCII order while working in Mac OS X 10.5.2 until Mac people fix this bug?
You spoke about es_ES.ISO8859-15 in Mac. Will it do the trick? Yes, as
far as I understand. But as I am using R.app, locale is set by the
system preferences. Truly, I am kind of a mess with this issue.
Could I force es_ES.ISO8859-15 as a locale in the Mac.
Sorry of I put another question here... why does Excel order list
correctly? I guess it doesn't relies on Mac settings.
As a R newbie I must recognize that this, and others, behaviours are
really hard to deal with. But I've seen, an even done, such an amount of
wonderful things with R that it is worth all efforts. Thanks for your help.
All the best,
Ricardo
Prof Brian Ripley wrote:
> This is a known Mac OS X bug, nothing to do with R which uses the
> system functions (strcoll/wcscoll) for such things.
>
> If you look at the help for sort, it refers you to ?Comparison. Which
> says
>
> Comparison of strings in character vectors is lexicographic within
> the strings using the collating sequence of the locale in use: see
> 'locales'. The collating sequence of locales such as 'en_US' is
> normally different from 'C' (which should use ASCII) and can be
> surprising. Beware of making _any_ assumptions about the
> collation order: e.g. in Estonian 'Z' comes between 'S' and 'T',
> and collation is not necessarily character-by-character - in
> Danish 'aa' sorts as a single letter, after 'z'. Some platforms
> may not respect the locale and always sort in ASCII. (String
> comparison is always for the part of the string up to the first
> nul if there are embedded nuls.)
>
> Mac OS X (more specifically, 10.5.2 on i386) is one of those
> disrespectful platforms.
>
>> x <- intToUtf8(c(32:127, 160:255), multiple=T)
>> order(x)
> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
> 17 18
> [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
> 35 36
> [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
> 53 54
> [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
> 71 72
> [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
> 89 90
> [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
> 107 108
> [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124
> 125 126
> [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
> 143 144
> [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
> 161 162
> [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178
> 179 180
> [181] 181 182 183 184 185 186 187 188 189 190 191 192
>
> which is quite different from Linux or Solaris. This may not come
> out, but paste(sort(x), collapse="") includes
>
> aAªáÁàÀâÂåÅäÄãÃæÆbBcCçÇdDeEéÉèÈêÊëË
>
> on Linux in es_ES.utf8 .
>
> Platforms are a lot worse at sorting in UTF-8 than 8-bit encodings.
> Mac OS X has es_ES.ISO8859-15, and that does do a reasonable job
> including aáàâåäãæ .
--
Ricardo Rodríguez
Your XEN ICT Team
More information about the R-help
mailing list