[Rd] Bug in rank with utf8?
peter dalgaard
pdalgd at gmail.com
Thu Aug 13 16:19:15 CEST 2015
Yes, collation is a strange thing, and?
Collation order will depend on locale settings, and there are quite a few cases where the collation order of two items is not defined.
To add to the confusion, on OSX Mavericks, I see
> x <- "\u0663"
> y <- 3
>
> x == y
[1] FALSE
> rank(c(x, y))
[1] 2 1
> x
[1] "٣"
> x == y
[1] FALSE
> x > y
[1] TRUE
> x < y
[1] FALSE
> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
> Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8"
Notice the differences from en_US.UTF8 (sans hyphen) on your system....
-pd
On 13 Aug 2015, at 16:01 , John McKown <john.archie.mckown at gmail.com> wrote:
> 2015-08-13 8:39 GMT-05:00 Hadley Wickham <h.wickham at gmail.com>:
>
>> x <- "\u0663"
>> y <- 3
>>
>> x == y
>> # FALSE
>> rank(c(x, y))
>> # c(1.5, 1.5)
>>
>
> also interesting, and confusing to me:
>
>> x == y
> [1] FALSE
>> x > y
> [1] FALSE
>> x < y
> [1] FALSE
>>
>
> With some slight changes:
>
>> x <- "\u0663"
>> y <- "3"
>> xy <- c(x,y)
>> rank(xy);
> [1] 1.5 1.5
>> Sys.getlocale();
> [1]
> "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
>> Sys.setlocale(category="LC_COLLATE", locale="C");
> [1] "C"
>> rank(xy);
> [1] 2 1
>>
>
>
>
>> --
>> http://had.co.nz/
>>
>>
> --
>
> Schrodinger's backup: The condition of any backup is unknown until a
> restore is attempted.
>
> Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
>
> He's about as useful as a wax frying pan.
>
> 10 to the 12th power microphones = 1 Megaphone
>
> Maranatha! <><
> John McKown
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel
mailing list