[Rd] Change in grep behavior from 1.9.0 to R-patched
Marc Schwartz
MSchwartz at MedAnalytics.com
Fri Jun 11 17:46:37 CEST 2004
On Fri, 2004-06-11 at 10:28, Prof Brian Ripley wrote:
> This is actually PCRE. Something is wrong with your build of R-patched
> (1.9.1 alpha, I assume): I get 84 everywhere. You are asking for a first
> character l, then one or more characters of `word' then tmean. In your
> example this is the same as (in a suitable locale, including C)
>
> length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
> length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
>
> which each give 84.
>
> One issue: PCRE is locale-dependent. Did you use the same locale for
> each? What happens if you force LANG=C?
>
> (I've just checked an R-devel Solaris system. This gave 13 on a build
> from Weds, and 84 when remade today. The result with 13 seems truncated,
> as they are the first 13. Might be coincidental, of course.)
The above is confirmed using Version 1.9.1 alpha (2004-06-10) on FC2:
> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
> length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
[1] 84
> length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
[1] 84
Also, to demonstrate Roger's follow up example:
> d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value
= TRUE)))
> summary(d)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 13.00 13.00 14.14 13.00 84.00
BTW: pcre-4.5-2
HTH,
Marc Schwartz
More information about the R-devel
mailing list