[Rd] bug in rank(), order(), is.unsorted() on character vector

Thu Dec 8 10:57:02 CET 2011

Hi Paul,

On 11-12-07 10:29 AM, Roebuck,Paul L wrote:
> Do this first and try again.
>
> R>  Sys.setlocale("LC_COLLATE", "C")

OK I see it now (in ?Sys.setlocale):

   Sys.setlocale("LC_COLLATE", "C")   # turn off locale-specific sorting,
                                      #  usually

Thanks all for the answers!

I never really realized how far some collating sequence could go in
terms of counter-intuitiveness e.g. the fact that LC_COLLATE=en_CA.UTF-8
doesn't preserve the order of the strings when a common suffix is
added to them is scary. Also it's not that LC_COLLATE=en_CA.UTF-8
just ignores the '_' (underscores) and the '.' (dots), that can only be
the first pass, then it needs to break ties in a way that defines a
total order. So it looks like the exact definition of this collating
sequence is counter-intuitive and complicated.

Maybe that's just how things are and the developers that want
portability and reproducibility of their code are already putting
a Sys.setlocale("LC_COLLATE", "C") statement somewhere in their package
to force all their users to be on the same collating sequence.
It sounds a little bit drastic though and it might introduce some
conflicts with other packages.

So maybe a better approach is to only alter LC_COLLATE temporarily
inside the functions where it matters i.e. where the returned value
actually depends on the collating sequence? If I don't do this, then
there is no way I can write a test for my function because the
test would work for me but fail for someone else.

Actually this is the situation I was facing when I did my first post:
I have a function that downloads a list of sequences from the Ensembl
FTP server, sorts them by name, and returns them to the user. I have
a test for that function and the test was working for me when I was
doing

   tools::testInstalledPackage("MyPackage", "types="tests")

but it was failing when I was doing 'R CMD check'. It seems that
the latter alters LC_COLLATE before running the tests (maybe to
LC_COLLATE=C) but not the former. I fixed this by enforcing
LC_COLLATE=C inside my function.

A naive question: wouldn't everything be simpler if LC_COLLATE=C
was the default for everybody?

Thanks,
H.

>
>
> On 12/7/11 3:41 AM, "Hervé Pagès"<hpages at fhcrc.org>  wrote:
>
>> Hi,
>>
>> This looks OK:
>>
>>> x<- c("_1_", "1_9", "2_9")
>>> rank(x)
>> [1] 1 2 3
>>
>> But this does not:
>>
>>> xa<- paste(x, "a", sep="")
>>> xa
>> [1] "_1_a" "1_9a" "2_9a"
>>> rank(xa)
>> [1] 2 1 3
>>
>> Cheers,
>> H.
>>
>>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>    [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>    [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>    [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>>    [7] LC_PAPER=C                 LC_NAME=C
>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.14.0
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319