[Rd] Change in grep behavior from 1.9.0 to R-patched

Fri Jun 11 18:21:06 CEST 2004

I think I have a solution I am just about to commit.  It looks as if the 
PCRE documentation I read is wrong as to when it is safe to free the 
locale-specific tables, and I've deferred doing so until much later.

Incidentally, I cannot make this misbehave on Windows.

On Fri, 11 Jun 2004, Prof Brian Ripley wrote:

> So the consensus is
> 
> - it happens equally in 1.9.0 and 1.9.1 alpha current
> - it happens in the C locale
> - it is random and bursty, as in
> 
> > d
>    [1] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 
> 84 84
>   [25] 84 84 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
> 13 13
>   [49] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
> 13 13
>   [73] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
> 13 13
>   [97] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
> 13 13
>  [121] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 
> 13 13
>  [145] 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 84 84 84 84 
> 84 84
>  [169] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 
> 84 84
>  [193] 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 84 13 13 84 84 84 13 
> 13 13
>  [217] 84 84 84 13 13 13 84 84 84 13 13 13 84 84 84 13 13 13 13 13 13 13 
> 13 13
> ...
> 
> So looks like a problem in the PCRE compiled code.
> 
> On Fri, 11 Jun 2004, Marc Schwartz wrote:
> 
> > On Fri, 2004-06-11 at 10:28, Prof Brian Ripley wrote:
> > > This is actually PCRE.  Something is wrong with your build of R-patched
> > > (1.9.1 alpha, I assume): I get 84 everywhere.  You are asking for a first
> > > character l, then one or more characters of `word' then tmean.  In your
> > > example this is the same as (in a suitable locale, including C)
> > > 
> > > length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
> 
> I omitted _ there, not that it mattered.
> 
> > > length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
> > > 
> > > which each give 84.
> > > 
> > > One issue: PCRE is locale-dependent.  Did you use the same locale for 
> > > each?  What happens if you force LANG=C?
> > > 
> > > (I've just checked an R-devel Solaris system.  This gave 13 on a build 
> > > from Weds, and 84 when remade today.  The result with 13 seems truncated, 
> > > as they are the first 13.  Might be coincidental, of course.)
> > 
> > 
> > The above is confirmed using Version 1.9.1 alpha (2004-06-10) on FC2:
> > 
> > > x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
> > > length(grep("^l[A-Za-z0-9]+tmean", x, perl = TRUE, value = TRUE))
> > [1] 84
> > > length(grep("^l[[:alnum:]_]+tmean", x, perl = TRUE, value = TRUE))
> > [1] 84
> > 
> > 
> > Also, to demonstrate Roger's follow up example:
> > 
> > > d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value
> > = TRUE)))
> > > summary(d)
> >    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
> >   13.00   13.00   13.00   14.14   13.00   84.00 
> 
> table(d) is more informative.
> 
> > BTW: pcre-4.5-2
> 
> Did you use --with-pcre, though?
> 
> 

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595