[Rd] R string comparisons may vary with platform (plain text)
peter dalgaard
pdalgd at gmail.com
Sun Nov 23 10:39:09 CET 2014
> On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu> wrote:
>
> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
> <murdoch.duncan at gmail.com> wrote:
>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
>>> A colleague¹s R program behaved differently when I ran it, and we thought
>>> we traced it probably to different results from string comparisons as
>>> below, with different R versions. However the platforms also differed. A
>>> friend ran it on a few machines and found that the comparison behavior
>>> didn¹t correlate with R version, but rather with platform.
>>>
>>> I wonder if you¹ve seen this. If it¹s not some setting I¹m unaware of,
>>> maybe someone should look into it. Sorry I haven¹t taken the time to read
>>> the source code myself.
>>
>> Looks like a collation order issue. See ?Comparison.
>
> With the oddity that both platforms use what look like similar locales:
>
> LC_COLLATE=en_US.UTF-8
> LC_COLLATE=en_US.utf8
It's the sort of thing thay I've tried to wrap my mind around multiple times and failed, but have a look at
http://stackoverflow.com/questions/19967555/postgres-collation-differences-osx-v-ubuntu
which seems to be essentially the same issue, just for Postgres. If you have the stamina, also look into the python question that it links to.
As I understand it, there are two potential reasons: Either the two platforms are not using the same collation table for en_US, or at least one of them is not fully implementing the Unicode Collation Algorithm.
In general, collation is a minefield: Some languages have the same letters in different order (e.g. Estonian with Z between S and T); accented characters sort with the unaccented counterpart in some languages but as separate characters in others; some locales sort ABab, others AaBb, yet others aAbB; sometimes punctuation is ignored, sometimes not; sometimes multiple characters count as one, etc.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel
mailing list