[Rd] bug in rank(), order(), is.unsorted() on character vector

Roebuck,Paul L proebuck at mdanderson.org
Thu Dec 8 17:31:22 CET 2011


On 12/8/11 3:57 AM, "Hervé Pagès" <hpages at fhcrc.org> wrote:

> On 11-12-07 10:29 AM, Roebuck,Paul L wrote:
>> Do this first and try again.
>> 
>> R>  Sys.setlocale("LC_COLLATE", "C")
> 
> OK I see it now (in ?Sys.setlocale):
> 
>    Sys.setlocale("LC_COLLATE", "C")   # turn off locale-specific sorting,
>                                       #  usually
> 
> Thanks all for the answers!
> 
> I never really realized how far some collating sequence could go in
> terms of counter-intuitiveness e.g. the fact that LC_COLLATE=en_CA.UTF-8
> doesn't preserve the order of the strings when a common suffix is
> added to them is scary. Also it's not that LC_COLLATE=en_CA.UTF-8
> just ignores the '_' (underscores) and the '.' (dots), that can only be
> the first pass, then it needs to break ties in a way that defines a
> total order. So it looks like the exact definition of this collating
> sequence is counter-intuitive and complicated.
> 
> Maybe that's just how things are and the developers that want
> portability and reproducibility of their code are already putting
> a Sys.setlocale("LC_COLLATE", "C") statement somewhere in their package
> to force all their users to be on the same collating sequence.
> It sounds a little bit drastic though and it might introduce some
> conflicts with other packages.
> 
> So maybe a better approach is to only alter LC_COLLATE temporarily
> inside the functions where it matters i.e. where the returned value
> actually depends on the collating sequence? If I don't do this, then
> there is no way I can write a test for my function because the
> test would work for me but fail for someone else.
> 
> Actually this is the situation I was facing when I did my first post:
> I have a function that downloads a list of sequences from the Ensembl
> FTP server, sorts them by name, and returns them to the user. I have
> a test for that function and the test was working for me when I was
> doing
> 
>    tools::testInstalledPackage("MyPackage", "types="tests")
> 
> but it was failing when I was doing 'R CMD check'. It seems that
> the latter alters LC_COLLATE before running the tests (maybe to
> LC_COLLATE=C) but not the former. I fixed this by enforcing
> LC_COLLATE=C inside my function.

Another developer here just ran into the problem two weeks ago when
data being processed on different machines (Linux,Windows) had different
results due to sorting. From my standpoint, I'm very hesitant to make
changes that affect behavior globally, so we changed it at the function
level in the package, did the sort and reset to original value using
on.exit() method.

As far as analysis reports, I believe we may need to set the LC_COLLATE
to the POSIX locale in ALL our standard Sweave templates as well to
ensure reproducibility, which is a BIG deal here.

> 
> A naive question: wouldn't everything be simpler if LC_COLLATE=C
> was the default for everybody?

Sure, but where's the fun in that? :)

>> 
>> 
>> On 12/7/11 3:41 AM, "Hervé Pagès"<hpages at fhcrc.org>  wrote:
>> 
>>> This looks OK:
>>> 
>>>> x<- c("_1_", "1_9", "2_9")
>>>> rank(x)
>>> [1] 1 2 3
>>> 
>>> But this does not:
>>> 
>>>> xa<- paste(x, "a", sep="")
>>>> xa
>>> [1] "_1_a" "1_9a" "2_9a"
>>>> rank(xa)
>>> [1] 2 1 3
>>> 
>>> Cheers,
>>> H.
>>> 
>>>> sessionInfo()
>>> R version 2.14.0 (2011-10-31)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> 
>>> locale:
>>>    [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>>    [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>>    [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>> 
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.14.0
>>> 



More information about the R-devel mailing list