[BioC] WARNING: difference in sorting order depending on computer platform?!?
Wolfgang Huber
whuber at embl.de
Fri Jan 29 21:35:07 CET 2010
Hi Seth and Jenny
not quite so... have a look at the "Details" section of the manual page
"Comparison" in the base package (type: "? Comparison"):
Comparison of strings in character vectors is lexicographic within
the strings using the collating sequence of the locale in use: see
'locales'. The collating sequence of locales such as 'en_US' is
normally different from 'C' (which should use ASCII) and can be
surprising. Beware of making _any_ assumptions about the
collation order: e.g. in Estonian 'Z' comes between 'S' and 'T',
and collation is not necessarily character-by-character - in
Danish 'aa' sorts as a single letter, after 'z'. In Welsh 'ng'
may or may not be a single sorting unit: if it is it follows 'g'.
Some platforms may not respect the locale and always sort in
numerical order of the bytes in an 8-bit locale, or in Unicode
point order for a UTF-8 locale (and may not sort in the same order
for the same language in different character sets). Collation of
non-letters (spaces, punctuation signs, hyphens, fractions and so
on) is even more problematic.
In Jenny's case, it is probably best not to rely on any sorting
behaviour, and access the features based on their names.
Best wishes
Wolfgang
Seth Falcon wrote:
> Hi Jenny,
>
> On 1/28/10 12:16 PM, Jenny Drnevich wrote:
>> I just found a problem/discrepancy in running R on PC vs. Unix/Linux
>> server. Maybe it's widely known, but I didn't know about it and it
>> caused me big problems.
>
> Ouch, that's not a fun problem to run into. The issue here is not so
> much platform as what's called locale. Locale settings determine such
> things as how numbers should be displayed ("," vs "."), time format, and
> indeed sorting of strings.
>
> You can read up on locale on Wikipedia:
> http://en.wikipedia.org/wiki/Locale
>
> Different locale settings impose different orderings of strings. Once
> you know this, the good news is that you can control the locale setting
> that R uses and should be able to obtain stable sorting across platforms.
>
> Here's an example run on a Windows system:
>
>>> strsplit(Sys.getlocale(), ";")
>> [[1]]
>> [1] "LC_COLLATE=English_United States.1252"
>> [2] "LC_CTYPE=English_United States.1252"
>> [3] "LC_MONETARY=English_United States.1252"
>> [4] "LC_NUMERIC=C"
>> [5] "LC_TIME=English_United States.1252"
>>
>>> v = c("177_at", "1773_at")
>>> sort(v)
>> [1] "177_at" "1773_at"
>>> Sys.setlocale(locale="C")
>> [1] "C"
>>> sort(v)
>> [1] "1773_at" "177_at"
>
> Note that not all locales are available on all systems, but the "C"
> locale is the basic common denominator -- but only supports ASCII not
> extended character sets.
>
> In summary, I think you can continue to use your two different systems
> if you do Sys.setlocale(locale="C") at the start of your script.
>
> + seth
>
--
Best wishes
Wolfgang
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact
More information about the Bioconductor
mailing list