[Rd] R string comparisons may vary with platform (plain text)

peter dalgaard pdalgd at gmail.com
Sun Nov 23 10:39:09 CET 2014


> On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu> wrote:
> 
> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
> <murdoch.duncan at gmail.com> wrote:
>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
>>> A colleague¹s R program behaved differently when I ran it, and we thought
>>> we traced it probably to different results from string comparisons as
>>> below, with different R versions.  However the platforms also differed.  A
>>> friend ran it on a few machines and found that the comparison behavior
>>> didn¹t correlate with R version, but rather with platform.
>>> 
>>> I wonder if you¹ve seen this.  If it¹s not some setting I¹m unaware of,
>>> maybe someone should look into it.  Sorry I haven¹t taken the time to read
>>> the source code myself.
>> 
>> Looks like a collation order issue.  See ?Comparison.
> 
> With the oddity that both platforms use what look like similar locales:
> 
> LC_COLLATE=en_US.UTF-8
> LC_COLLATE=en_US.utf8

It's the sort of thing thay I've tried to wrap my mind around multiple times and failed, but have a look at

http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu

which seems to be essentially the same issue, just for Postgres. If you have the stamina, also look into the python question that it links to.

As I understand it, there are two potential reasons: Either the two platforms are not using the same collation table for en_US, or at least one of them is not fully implementing the Unicode Collation Algorithm.

In general, collation is a minefield: Some languages have the same letters in different order (e.g. Estonian with Z between S and T); accented characters sort with the unaccented counterpart in some languages but as separate characters in others; some locales sort ABab, others AaBb, yet others aAbB; sometimes punctuation is ignored, sometimes not; sometimes multiple characters count as one, etc.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-devel mailing list