[Rd] invalid regular expression '[a-Z]'

Henrik Bengtsson hb at stat.berkeley.edu
Thu Mar 6 03:42:51 CET 2008


On Wed, Mar 5, 2008 at 6:40 PM, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
> On Wed, Mar 5, 2008 at 6:18 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
>  > On 05/03/2008 8:56 PM, Henrik Bengtsson wrote:
>  >  > Hi,
>  >  >
>  >  > just curious, but does anyone know the source/reason of observing the
>  >  > following error on OSX but not on WinXP and Linux?
>  >
>  >  Presumably in the locale you're using on OSX, "a" < "Z" is false.  This
>  >  is the ascii sort order used in the C locale.  On my Windows box, "a" <
>  >  "Z" is true, because it uses the English_Canada.1252 collation order.
>
>  That's it indeed.  The person who first reported the error had
>  sessionInfo() locale
>  'en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8' and I
>  missed that 'C' in the middle, which I guess his system falls back to
>  if none of the previous ones exist?!?
>
>  Now I can reproduce it on both Windows and Linux:
>
>  > Sys.setlocale("LC_ALL", "C")
>  [1] "C"
>  > regexpr("[a-Z]", "foo")
>  Error in regexpr("[a-Z]", "foo") : invalid regular expression '[a-Z]'
>  > Sys.setlocale("LC_ALL", "en")
>  [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
>  C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
>  .1252"
>  > regexpr("[a-Z]", "foo")
>
> [1] 1
>  attr(,"match.length")
>  [1] 1
>
>  Case almost closed, but then the question is why don't you get an
>  error in one of the two cases '[a-Z]' and '[A-z]' then with the other
>  locale(s)?
>
>  > Sys.setlocale("LC_ALL", "en")
>  [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;L
>  C_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States
>  .1252"
>  > regexpr("[a-Z]", "foo")
>
> [1] 1
>  attr(,"match.length")
>  [1] 1
>  > regexpr("[A-z]", "foo")
>  [1] 1
>  attr(,"match.length")
>  [1] 1
>  > "a" < "Z"
>  [1] TRUE
>  > "a" > "Z"
>  [1] FALSE

My bad...

> "A" < "z"
[1] TRUE
>  > regexpr("[A-z]", "foo")
>  [1] 1
>  attr(,"match.length")
>  [1] 1
> "z" < "A"
[1] FALSE
> regexpr("[z-A]", "foo")
Error in regexpr("[z-A]", "foo") : invalid regular expression '[z-A]'

Case closed

/Henrik

>
>  Thanks
>
>  /Henrik
>
>
>
>  >
>  >  Duncan Murdoch
>  >
>  >
>  >   I've tried with a
>  >  > few different versions of R (v2.5.1, v2.6.1, v2.6.2, v2.7.0devel).
>  >  > The locale does not seem to affect the error, i.e. I've tested a few
>  >  > different and it is still only OSX that gives the error but not the
>  >  > other two.
>  >  >
>  >  >> regexpr("[a-Z]", "foo")
>  >  > Error in regexpr(pattern, text, extended, fixed, useBytes) :
>  >  >         invalid regular expression '[a-Z]'
>  >  >> regexpr("[a-zA-Z]", "foo")
>  >  > [1] 1
>  >  > attr(,"match.length")
>  >  > [1] 1
>  >  >> regexpr("[A-z]", "foo")
>  >  > [1] 1
>  >  > attr(,"match.length")
>  >  > [1] 1
>  >  >
>  >  > At least now I know it that the safest is to use '[a-zA-Z]' (or
>  >  > possibly '[[:alpha:]]').
>  >  >
>  >  > /Henrik
>  >  >
>  >  > ______________________________________________
>  >  > R-devel at r-project.org mailing list
>  >  > https://stat.ethz.ch/mailman/listinfo/r-devel
>  >
>  >
>



More information about the R-devel mailing list