[Rd] Change in grep behavior from 1.9.0 to R-patched
Prof Brian Ripley
ripley at stats.ox.ac.uk
Fri Jun 11 17:55:59 CEST 2004
So the consensus is
- it happens equally in 1.9.0 and 1.9.1 alpha current
- it happens in the C locale
- it is random and bursty, as in
> d
[1] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84
84 84
[25] 84 84 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
13 13
[49] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
13 13
[73] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
13 13
[97] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
13 13
[121] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
13 13
[145] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 84 84 84 84
84 84
[169] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84
84 84
[193] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 13 13 84 84 84 13
13 13
[217] 84 84 84 13 13 13 84 84 84 13 13 13 84 84 84 13 13 13 13 13 13 13
13 13
...
So looks like a problem in the PCRE compiled code.
On Fri, 11 Jun 2004, Marc Schwartz wrote:
> On Fri, 2004-06-11 at 10:28, Prof Brian Ripley wrote:
> > This is actually PCRE. Something is wrong with your build of R-patched
> > (1.9.1 alpha, I assume): I get 84 everywhere. You are asking for a first
> > character l, then one or more characters of `word' then tmean. In your
> > example this is the same as (in a suitable locale, including C)
> >
> > length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
I omitted _ there, not that it mattered.
> > length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
> >
> > which each give 84.
> >
> > One issue: PCRE is locale-dependent. Did you use the same locale for
> > each? What happens if you force LANG=C?
> >
> > (I've just checked an R-devel Solaris system. This gave 13 on a build
> > from Weds, and 84 when remade today. The result with 13 seems truncated,
> > as they are the first 13. Might be coincidental, of course.)
>
>
> The above is confirmed using Version 1.9.1 alpha (2004-06-10) on FC2:
>
> > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
> > length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
> [1] 84
> > length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
> [1] 84
>
>
> Also, to demonstrate Roger's follow up example:
>
> > d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value
> = TRUE)))
> > summary(d)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 13.00 13.00 13.00 14.14 13.00 84.00
table(d) is more informative.
> BTW: pcre-4.5-2
Did you use --with-pcre, though?
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list