[Rd] Change in grep behavior from 1.9.0 to R-patched
Roger D. Peng
rpeng at jhsph.edu
Fri Jun 11 17:54:21 CEST 2004
I have the following to environmental variables set:
LANGVAR=en_US.UTF-8
LANG=C
I don't know exactly what both of these mean, but I always
deliberately set LANG=C in my .tcshrc files since that is necessary to
get Acrobat Reader working on my Red Hat system. My guess is they
were both set this way at build time.
When I run Brian's two alternatives, I *always* get 84, no matter how
many times I repeat it. However, when I use \w+, I sometimes get 13
and sometimes get 84 (say, when repeated 1000 times).
-roger
Prof Brian Ripley wrote:
> This is actually PCRE. Something is wrong with your build of R-patched
> (1.9.1 alpha, I assume): I get 84 everywhere. You are asking for a first
> character l, then one or more characters of `word' then tmean. In your
> example this is the same as (in a suitable locale, including C)
>
> length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
> length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
>
> which each give 84.
>
> One issue: PCRE is locale-dependent. Did you use the same locale for
> each? What happens if you force LANG=C?
>
> (I've just checked an R-devel Solaris system. This gave 13 on a build
> from Weds, and 84 when remade today. The result with 13 seems truncated,
> as they are the first 13. Might be coincidental, of course.)
>
> On Fri, 11 Jun 2004, Roger D. Peng wrote:
>
>
>>I've noticed a change in the way grep() behaves between the 1.9.0
>>release and a recent R-patched. On 1.9.0 I get the following output:
>>
>> > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>> > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
>>[1] 84
>>
>>And on R-patched (2004-06-11) I get
>>
>> > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>> > length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
>>[1] 13
>>
>>I can't come up with a simpler example which is why I've posted my
>>actual character vector on the web (please let me know if there are
>>problems downloading it).
>>
>>I didn't find anything in the NEWs file that would indicate a change
>
>
> No change is intended and the underlying C code is unchanged.
>
>
>>and another problem is that I'm not sure which behavior is correct.
>>My knowledge of regular expressions is limited.
>
>
More information about the R-devel
mailing list