[Rd] R string comparisons may vary with platform (plain text)

Sun Nov 23 17:15:27 CET 2014

For many scientific applications one is really dealing with ASCII characters and 
LC_COLLATE="C", even if the user is running in non-C locales. What robust 
approaches (if any?) are available to write code that sorts in a 
locale-independent way? The Note in ?Sys.setlocale is not overly optimistic 
about setting the locale within a session.

Martin Morgan

On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:
> On 23/11/2014 09:39, peter dalgaard wrote:
>>
>>> On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu> wrote:
>>>
>>> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
>>> <murdoch.duncan at gmail.com> wrote:
>>>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
>>>>> A colleague¹s R program behaved differently when I ran it, and we thought
>>>>> we traced it probably to different results from string comparisons as
>>>>> below, with different R versions.  However the platforms also differed.  A
>>>>> friend ran it on a few machines and found that the comparison behavior
>>>>> didn¹t correlate with R version, but rather with platform.
>>>>>
>>>>> I wonder if you¹ve seen this.  If it¹s not some setting I¹m unaware of,
>>>>> maybe someone should look into it.  Sorry I haven¹t taken the time to read
>>>>> the source code myself.
>>>>
>>>> Looks like a collation order issue.  See ?Comparison.
>>>
>>> With the oddity that both platforms use what look like similar locales:
>>>
>>> LC_COLLATE=en_US.UTF-8
>>> LC_COLLATE=en_US.utf8
>>
>> It's the sort of thing thay I've tried to wrap my mind around multiple times
>> and failed, but have a look at
>>
>> http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu
>>
>>
>> which seems to be essentially the same issue, just for Postgres. If you have
>> the stamina, also look into the python question that it links to.
>>
>> As I understand it, there are two potential reasons: Either the two platforms
>> are not using the same collation table for en_US, or at least one of them is
>> not fully implementing the Unicode Collation Algorithm.
>
> And I have seen both with R.  At the very least, check if ICU is being used
> (capabilities("ICU") in current R, maybe not in some of the obsolete versions
> seen in this thread).
>
> As a further possibility, there are choices in the UCA (in R, see
> ?icuSetCollate) and ICU can be compiled with different default choices.  It is
> not clear to me what (if any) difference ICU versions make, but in R-devel
> extSoftVersion() reports that.
>
>
>> In general, collation is a minefield: Some languages have the same letters in
>> different order (e.g. Estonian with Z between S and T); accented characters
>> sort with the unaccented counterpart in some languages but as separate
>> characters in others; some locales sort ABab, others AaBb, yet others aAbB;
>> sometimes punctuation is ignored, sometimes not; sometimes multiple characters
>> count as one, etc.
>>
> As ?Comparison has long said.
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793