[Rd] Change in grep behavior from 1.9.0 to R-patched
Roger D. Peng
rpeng at jhsph.edu
Fri Jun 11 17:36:32 CEST 2004
To make matters a little more interesting, I get some weird behavior
on R 1.9.0 also. For example, when I run
x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
and then run
d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value
= TRUE)))
> summary(d)
Min. 1st Qu. Median Mean 3rd Qu. Max.
13.00 13.00 13.00 30.47 13.00 84.00
Similar behavior on both R 1.9.0 and today's R-patched (I'm running on
Linux). To me this smells like a memory issue in PCRE.
-roger
Martin Maechler wrote:
>>>>>>"Roger" == Roger D Peng <rpeng at jhsph.edu>
>>>>>> on Fri, 11 Jun 2004 10:43:57 -0400 writes:
>
>
> Roger> I've noticed a change in the way grep() behaves between the 1.9.0
> Roger> release and a recent R-patched. On 1.9.0 I get the following output:
>
> >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
> >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
> Roger> [1] 84
>
> Roger> And on R-patched (2004-06-11) I get
>
> >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
> >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
> Roger> [1] 13
>
> I can reproduce this exactly.
>
> <....>
>
> Roger> I didn't find anything in the NEWs file that would indicate a change
>
> yes: The src/extras/pcre/ (Perl Compatible Regular Expressions)
> library was upgraded, and since we assumed that wouldn't
> have any effect --- as we now see, a too optimistically ---
> it wasn't documented in NEWS
>
> Roger> and another problem is that I'm not sure which behavior is correct.
> Roger> My knowledge of regular expressions is limited.
>
> The first one is correct I think: '\w' means word constituents
> (see below) and for 1.9.0,
> you get
>
> > grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
> [1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean"
> [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean"
> [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean" "l3pm25tmean" "l3cotmean"
> [16] "l3no2tmean" "l3so2tmean" "l3o3tmean" "l4pm10tmean" "l4pm25tmean"
> [21] "l4cotmean" "l4no2tmean" "l4so2tmean" "l4o3tmean" "l5pm10tmean"
> [26] "l5pm25tmean" "l5cotmean" "l5no2tmean" "l5so2tmean" "l5o3tmean"
> [31] "l6pm10tmean" "l6pm25tmean" "l6cotmean" "l6no2tmean" "l6so2tmean"
> [36] "l6o3tmean" "l7pm10tmean" "l7pm25tmean" "l7cotmean" "l7no2tmean"
> [41] "l7so2tmean" "l7o3tmean" "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean"
> [46] "lm1no2tmean" "lm1so2tmean" "lm1o3tmean" "lm2pm10tmean" "lm2pm25tmean"
> [51] "lm2cotmean" "lm2no2tmean" "lm2so2tmean" "lm2o3tmean" "lm3pm10tmean"
> [56] "lm3pm25tmean" "lm3cotmean" "lm3no2tmean" "lm3so2tmean" "lm3o3tmean"
> [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean" "lm4no2tmean" "lm4so2tmean"
> [66] "lm4o3tmean" "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean" "lm5no2tmean"
> [71] "lm5so2tmean" "lm5o3tmean" "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean"
> [76] "lm6no2tmean" "lm6so2tmean" "lm6o3tmean" "lm7pm10tmean" "lm7pm25tmean"
> [81] "lm7cotmean" "lm7no2tmean" "lm7so2tmean" "lm7o3tmean"
> >
>
> which is correct AFAICS and shouldn't be shorted to the only 13 elements
>
>
>>grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
>
> [1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean"
> [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean"
> [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean"
>
> in R-patched.
>
> ------------
>
> For me, 'man perlre' contains
>
>
>>> \w Match a "word" character (alphanumeric plus "_")
>
>
> <......>
>
>>> A "\w" matches a single alphanumeric character or "_", not a whole
>>> word. Use "\w+" to match a string of Perl-identifier characters (which
>>> isn't the same as matching an English word). If "use locale" is in
>>> effect, the list of alphabetic characters generated by "\w" is taken
>>> from the current locale. See the perllocale manpage. .......
>
>
> so it may well be connected to locale problems. But I don't
> think any locale should have
> "l2pm25tmean" matched by '^l\w+tmean' but not match
> "lm5pm25tmean"
>
> [If making a difference between these two, it should rather be
> the other way round].
>
> Martin Maechler
>
>
>
More information about the R-devel
mailing list