[Rd] Change in grep behavior from 1.9.0 to R-patched

Fri Jun 11 17:36:32 CEST 2004

To make matters a little more interesting, I get some weird behavior 
on R 1.9.0 also.  For example, when I run

x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))

and then run

d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value 
= TRUE)))

 > summary(d)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   13.00   13.00   13.00   30.47   13.00   84.00

Similar behavior on both R 1.9.0 and today's R-patched (I'm running on 
Linux).  To me this smells like a memory issue in PCRE.

-roger

Martin Maechler wrote:
>>>>>>"Roger" == Roger D Peng <rpeng at jhsph.edu>
>>>>>>    on Fri, 11 Jun 2004 10:43:57 -0400 writes:
> 
> 
>     Roger> I've noticed a change in the way grep() behaves between the 1.9.0 
>     Roger> release and a recent R-patched.  On 1.9.0 I get the following output:
> 
>     >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>     >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
>     Roger> [1] 84
> 
>     Roger> And on R-patched (2004-06-11) I get
> 
>     >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>     >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
>     Roger> [1] 13
> 
> I can reproduce this exactly.
> 
>     <....>
> 
>     Roger> I didn't find anything in the NEWs file that would indicate a change 
> 
> yes: The src/extras/pcre/ (Perl Compatible Regular Expressions)
>      library was upgraded, and since we assumed that wouldn't
>      have any effect --- as we now see, a too optimistically ---
>      it wasn't documented in NEWS
> 
>     Roger> and another problem is that I'm not sure which behavior is correct. 
>     Roger> My knowledge of regular expressions is limited.
> 
> The first one is correct I think: '\w' means word constituents
> (see below) and for 1.9.0, 
> you get
> 
>  > grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
>   [1] "l1pm10tmean"  "l1pm25tmean"  "l1cotmean"    "l1no2tmean"   "l1so2tmean"  
>   [6] "l1o3tmean"    "l2pm10tmean"  "l2pm25tmean"  "l2cotmean"    "l2no2tmean"  
>  [11] "l2so2tmean"   "l2o3tmean"    "l3pm10tmean"  "l3pm25tmean"  "l3cotmean"   
>  [16] "l3no2tmean"   "l3so2tmean"   "l3o3tmean"    "l4pm10tmean"  "l4pm25tmean" 
>  [21] "l4cotmean"    "l4no2tmean"   "l4so2tmean"   "l4o3tmean"    "l5pm10tmean" 
>  [26] "l5pm25tmean"  "l5cotmean"    "l5no2tmean"   "l5so2tmean"   "l5o3tmean"   
>  [31] "l6pm10tmean"  "l6pm25tmean"  "l6cotmean"    "l6no2tmean"   "l6so2tmean"  
>  [36] "l6o3tmean"    "l7pm10tmean"  "l7pm25tmean"  "l7cotmean"    "l7no2tmean"  
>  [41] "l7so2tmean"   "l7o3tmean"    "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean"  
>  [46] "lm1no2tmean"  "lm1so2tmean"  "lm1o3tmean"   "lm2pm10tmean" "lm2pm25tmean"
>  [51] "lm2cotmean"   "lm2no2tmean"  "lm2so2tmean"  "lm2o3tmean"   "lm3pm10tmean"
>  [56] "lm3pm25tmean" "lm3cotmean"   "lm3no2tmean"  "lm3so2tmean"  "lm3o3tmean"  
>  [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean"   "lm4no2tmean"  "lm4so2tmean" 
>  [66] "lm4o3tmean"   "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean"   "lm5no2tmean" 
>  [71] "lm5so2tmean"  "lm5o3tmean"   "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean"  
>  [76] "lm6no2tmean"  "lm6so2tmean"  "lm6o3tmean"   "lm7pm10tmean" "lm7pm25tmean"
>  [81] "lm7cotmean"   "lm7no2tmean"  "lm7so2tmean"  "lm7o3tmean"  
>  > 
> 
> which is correct AFAICS and shouldn't be shorted to the only 13 elements
> 
> 
>>grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
> 
>  [1] "l1pm10tmean" "l1pm25tmean" "l1cotmean"   "l1no2tmean"  "l1so2tmean" 
>  [6] "l1o3tmean"   "l2pm10tmean" "l2pm25tmean" "l2cotmean"   "l2no2tmean" 
> [11] "l2so2tmean"  "l2o3tmean"   "l3pm10tmean"
> 
> in R-patched.
> 
> ------------
> 
> For me,  'man perlre' contains
> 
> 
>>>        \w  Match a "word" character (alphanumeric plus "_")
> 
> 
>          <......>
> 
>>>    A "\w" matches a single alphanumeric character or "_", not a whole
>>>    word.  Use "\w+" to match a string of Perl-identifier characters (which
>>>    isn't the same as matching an English word).  If "use locale" is in
>>>    effect, the list of alphabetic characters generated by "\w" is taken
>>>    from the current locale.  See the perllocale manpage. .......
> 
> 
> so it may well be connected to locale problems.  But I don't
> think any locale should  have   
>  "l2pm25tmean" matched by  '^l\w+tmean'   but not match
>  "lm5pm25tmean"
> 
> [If making a difference between these two, it should rather be
>  the other way round].
> 
> Martin Maechler
> 
> 
>