[BioC] WARNING: difference in sorting order depending on computer platform?!?

Fri Jan 29 21:35:07 CET 2010

Hi Seth and Jenny

not quite so... have a look at the "Details" section of the manual page 
"Comparison" in the base package (type:  "? Comparison"):

   Comparison of strings in character vectors is lexicographic within
   the strings using the collating sequence of the locale in use: see
   'locales'.  The collating sequence of locales such as 'en_US' is
   normally different from 'C' (which should use ASCII) and can be
   surprising.  Beware of making _any_ assumptions about the
   collation order: e.g. in Estonian 'Z' comes between 'S' and 'T',
   and collation is not necessarily character-by-character - in
   Danish 'aa' sorts as a single letter, after 'z'.  In Welsh 'ng'
   may or may not be a single sorting unit: if it is it follows 'g'.
   Some platforms may not respect the locale and always sort in
   numerical order of the bytes in an 8-bit locale, or in Unicode
   point order for a UTF-8 locale (and may not sort in the same order
   for the same language in different character sets).  Collation of
   non-letters (spaces, punctuation signs, hyphens, fractions and so
   on) is even more problematic.

In Jenny's case, it is probably best not to rely on any sorting 
behaviour, and access the features based on their names.

	Best wishes
	Wolfgang

Seth Falcon wrote:
> Hi Jenny,
> 
> On 1/28/10 12:16 PM, Jenny Drnevich wrote:
>> I just found a problem/discrepancy in running R on PC vs. Unix/Linux
>> server. Maybe it's widely known, but I didn't know about it and it
>> caused me big problems.
> 
> Ouch, that's not a fun problem to run into.  The issue here is not so 
> much platform as what's called locale.  Locale settings determine such 
> things as how numbers should be displayed ("," vs "."), time format, and 
> indeed sorting of strings.
> 
> You can read up on locale on Wikipedia:
> http://en.wikipedia.org/wiki/Locale
> 
> Different locale settings impose different orderings of strings.  Once 
> you know this, the good news is that you can control the locale setting 
> that R uses and should be able to obtain stable sorting across platforms.
> 
> Here's an example run on a Windows system:
> 
>>> strsplit(Sys.getlocale(), ";")
>> [[1]]
>> [1] "LC_COLLATE=English_United States.1252"
>> [2] "LC_CTYPE=English_United States.1252"
>> [3] "LC_MONETARY=English_United States.1252"
>> [4] "LC_NUMERIC=C"
>> [5] "LC_TIME=English_United States.1252"
>>
>>> v = c("177_at", "1773_at")
>>> sort(v)
>> [1] "177_at"  "1773_at"
>>> Sys.setlocale(locale="C")
>> [1] "C"
>>> sort(v)
>> [1] "1773_at" "177_at"
> 
> Note that not all locales are available on all systems, but the "C" 
> locale is the basic common denominator -- but only supports ASCII not 
> extended character sets.
> 
> In summary, I think you can continue to use your two different systems 
> if you do Sys.setlocale(locale="C") at the start of your script.
> 
> + seth
> 

-- 

Best wishes
      Wolfgang

--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact