[Rd] bug in rank(), order(), is.unsorted() on character vector
Hervé Pagès
hpages at fhcrc.org
Thu Dec 8 10:57:02 CET 2011
Hi Paul,
On 11-12-07 10:29 AM, Roebuck,Paul L wrote:
> Do this first and try again.
>
> R> Sys.setlocale("LC_COLLATE", "C")
OK I see it now (in ?Sys.setlocale):
Sys.setlocale("LC_COLLATE", "C") # turn off locale-specific sorting,
# usually
Thanks all for the answers!
I never really realized how far some collating sequence could go in
terms of counter-intuitiveness e.g. the fact that LC_COLLATE=en_CA.UTF-8
doesn't preserve the order of the strings when a common suffix is
added to them is scary. Also it's not that LC_COLLATE=en_CA.UTF-8
just ignores the '_' (underscores) and the '.' (dots), that can only be
the first pass, then it needs to break ties in a way that defines a
total order. So it looks like the exact definition of this collating
sequence is counter-intuitive and complicated.
Maybe that's just how things are and the developers that want
portability and reproducibility of their code are already putting
a Sys.setlocale("LC_COLLATE", "C") statement somewhere in their package
to force all their users to be on the same collating sequence.
It sounds a little bit drastic though and it might introduce some
conflicts with other packages.
So maybe a better approach is to only alter LC_COLLATE temporarily
inside the functions where it matters i.e. where the returned value
actually depends on the collating sequence? If I don't do this, then
there is no way I can write a test for my function because the
test would work for me but fail for someone else.
Actually this is the situation I was facing when I did my first post:
I have a function that downloads a list of sequences from the Ensembl
FTP server, sorts them by name, and returns them to the user. I have
a test for that function and the test was working for me when I was
doing
tools::testInstalledPackage("MyPackage", "types="tests")
but it was failing when I was doing 'R CMD check'. It seems that
the latter alters LC_COLLATE before running the tests (maybe to
LC_COLLATE=C) but not the former. I fixed this by enforcing
LC_COLLATE=C inside my function.
A naive question: wouldn't everything be simpler if LC_COLLATE=C
was the default for everybody?
Thanks,
H.
>
>
> On 12/7/11 3:41 AM, "Hervé Pagès"<hpages at fhcrc.org> wrote:
>
>> Hi,
>>
>> This looks OK:
>>
>>> x<- c("_1_", "1_9", "2_9")
>>> rank(x)
>> [1] 1 2 3
>>
>> But this does not:
>>
>>> xa<- paste(x, "a", sep="")
>>> xa
>> [1] "_1_a" "1_9a" "2_9a"
>>> rank(xa)
>> [1] 2 1 3
>>
>> Cheers,
>> H.
>>
>>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
>> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.14.0
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel
mailing list