[R] unexpected behaviour of sub() / usage of regexp
Sarah Goslee
sarah.goslee at gmail.com
Fri Dec 9 15:37:27 CET 2011
But I do get the incorrect result on R 2.14.0 on linux:
> sub('[[:digit:]]{1,2}', '', '9ewww')
[1] "www"
And also:
> sub('[[:digit:]]{1,2}', '', '9ewww')
[1] "www"
> sub('[[:digit:]]{1,2}', '', 'ewww9')
[1] "ww9"
> sub('\\d{1,2}', '', 'ewww9')
[1] "ww9"
But:
> sub('\\d', '', 'ewww9')
[1] "ewww"
> sub('\\d*', '', '9ewww')
[1] "ewww"
So it seems to be something about the way the curly braces are
handled, but only with certain groups:
> sub('e{1,2}', '', '9ewww')
[1] "9www"
> sub('9{1,2}', '', '9ewww')
[1] "ewww"
But, as Prof. Ripley's email suggests, perl=TRUE solves the problem.
(I was trying out various combinations when it appeared in my inbox.)
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
On Fri, Dec 9, 2011 at 9:25 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
> On 09/12/2011 9:20 AM, Jannis wrote:
>>
>> Dear R users,
>>
>>
>> the way I understand the documentation of sub() and regexp the following
>> code:
>>
>>
>>
>> sub('[[:digit:]]{1,2}', '', '9ewww')
>>
>>
>>
>> ... should yield:
>>
>> 'ewww'
>>
>>
>> It returns, however:
>>
>> 'www'
>>
>>
>> Why is this the case? My code should just substitute 1 (minimum) or up to
>> 2 (maximum) digits, i.e. numbers and not the 'e' in the string. Do I
>> misinterpret something here?
>
>
> I get your expected output of "ewww" running 2.14.0 or 2.14.0-patched on
> Windows. So it's not a universal problem...
>
> Duncan Murdoch
>
>>
>> Thanks for any ideas
>> Jannis
>>
>>
>> > sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: i686-pc-linux-gnu (32-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3]
>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5]
>> LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C
>> LC_NAME=C [9] LC_ADDRESS=C
>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8
>> LC_IDENTIFICATION=C
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
--
Sarah Goslee
http://www.functionaldiversity.org
More information about the R-help
mailing list