[Rd] bug in rank(), order(), is.unsorted() on character vector
peter dalgaard
pdalgd at gmail.com
Wed Dec 7 19:30:10 CET 2011
On Dec 7, 2011, at 15:48 , Joris Meys wrote:
> @Barry : regardless of whether '_' comes before or after '1' , it
> should be consistent. Adding an 'a' shouldn't shift '_' from before
> '1' to between '1' and '2', that's clearly an error. The help files
> are not stating anything about that. The only thing I can imagine, is
> that '_' gets ignored (in that case 19a would rank before 1a).
As far as I remember, that is exactly the case. In some locales, and not even consistently across different OS versions of the "same" locale, there are characters that are ignored for collation. With that in mind, what we see is really not any stranger than "a" < "ab" but "ac" > "abc".
R just uses what the OS supplies, so if you want to use words like "inconsistent" or "error", please direct them at those who define the locales. (And be prepared to realize that you may have kicked a hornet's nest...)
>
> This said, I can't reproduce.
>
>> x <- c("_1_", "1_9", "2_9")
>> xa <- paste(x,'a',sep='')
>> rank(x)
> [1] 1 2 3
>> rank(xa)
> [1] 1 2 3
>
>> sessionInfo()
> R version 2.14.0 Patched (2006-00-00 r00000)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C LC_TIME=English_United
> States.1252
>
> attached base packages:
> [1] grDevices datasets splines graphics stats tcltk utils
> methods base
>
> other attached packages:
> [1] svSocket_0.9-51 TinnR_1.0.3 R2HTML_2.2 Hmisc_3.8-3
> survival_2.36-9
>
> loaded via a namespace (and not attached):
> [1] cluster_1.14.1 grid_2.14.0 lattice_0.19-33 svMisc_0.9-63
> tools_2.14.0
>
>
> 2011/12/7 Hervé Pagès <hpages at fhcrc.org>:
>> Hi,
>>
>> This looks OK:
>>
>>> x <- c("_1_", "1_9", "2_9")
>>> rank(x)
>> [1] 1 2 3
>>
>> But this does not:
>>
>>> xa <- paste(x, "a", sep="")
>>> xa
>> [1] "_1_a" "1_9a" "2_9a"
>>> rank(xa)
>> [1] 2 1 3
>>
>> Cheers,
>> H.
>>
>>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
>> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.14.0
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel
mailing list