[Rd] invalid regular expression '[a-Z]'

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Mar 6 08:09:27 CET 2008

On Wed, 5 Mar 2008, Henrik Bengtsson wrote:

> On Wed, Mar 5, 2008 at 6:18 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
>> On 05/03/2008 8:56 PM, Henrik Bengtsson wrote:
>> > Hi,
>> >
>> > just curious, but does anyone know the source/reason of observing the
>> > following error on OSX but not on WinXP and Linux?
>>  Presumably in the locale you're using on OSX, "a" < "Z" is false.  This
>>  is the ascii sort order used in the C locale.  On my Windows box, "a" <
>>  "Z" is true, because it uses the English_Canada.1252 collation order.
> That's it indeed.  The person who first reported the error had
> sessionInfo() locale
> 'en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8' and I
> missed that 'C' in the middle, which I guess his system falls back to
> if none of the previous ones exist?!?

No.  Those are settings for various categories, just as you showed for 
Window.  The first setting appears to be LC_COLLATE, but what they mean is 
not documented on the system man page for setlocale.

It's just that MacOS uses C collation order in English locales, even 
though almost everyone else uses aAbB or AaBb (the latter being what the 
English actually use, as do almost all book indices in dialects of 
English).  But then there is no surprise that MacOS has to be different 
... its implementaton of locales is idiosyncratic (to be generous).

Note that even [A-Za-z] is unsafe -- as I recall Z is in the middle of the 
alphabet in Estonian locales.  If you want alphabetic characters, use 
[[:alpha:]].  If you want ASCII alphabetic characters, write out the 
ranges as [AB...Zab...z]

E.g. (F8 Linux)

> Sys.setlocale("LC_COLLATE", "et_EE.utf8")
[1] "et_EE.utf8"
> paste(sort(c(letters,LETTERS)), collapse="")
[1] "AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsZzTtUuVvWwXxYy"


