[R] unexpected behaviour of sub() / usage of regexp

Fri Dec 9 15:37:27 CET 2011

But I do get the incorrect result on R 2.14.0 on linux:
> sub('[[:digit:]]{1,2}', '', '9ewww')
[1] "www"

And also:

> sub('[[:digit:]]{1,2}', '', '9ewww')
[1] "www"
> sub('[[:digit:]]{1,2}', '', 'ewww9')
[1] "ww9"
> sub('\\d{1,2}', '', 'ewww9')
[1] "ww9"

But:
> sub('\\d', '', 'ewww9')
[1] "ewww"
> sub('\\d*', '', '9ewww')
[1] "ewww"

So it seems to be something about the way the curly braces are
handled, but only with certain groups:

> sub('e{1,2}', '', '9ewww')
[1] "9www"
> sub('9{1,2}', '', '9ewww')
[1] "ewww"

But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> On 09/12/2011 9:20 AM, Jannis wrote:
>>
>> Dear R users,
>>
>>
>> the way I understand the documentation of sub() and regexp the following
>> code:
>>
>>
>>
>> sub('[[:digit:]]{1,2}', '', '9ewww')
>>
>>
>>
>> ... should yield:
>>
>> 'ewww'
>>
>>
>> It returns, however:
>>
>> 'www'
>>
>>
>> Why is this the case? My code should just substitute 1 (minimum) or up to
>> 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I
>> misinterpret something here?
>
>
> I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on
> Windows.   So it's not a universal problem...
>
> Duncan Murdoch
>
>>
>> Thanks for any ideas
>> Jannis
>>
>>
>> >  sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: i686-pc-linux-gnu (32-bit)
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                [3]
>> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8      [5]
>> LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8     [7] LC_PAPER=C
>>           LC_NAME=C                   [9] LC_ADDRESS=C
>> LC_TELEPHONE=C            [11] LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>

-- 
Sarah Goslee
http://www.functionaldiversity.org