[Rd] Change in grep behavior from 1.9.0 to R-patched
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jun 11 17:21:43 CEST 2004
>>>>> "Roger" == Roger D Peng <rpeng at jhsph.edu>
>>>>> on Fri, 11 Jun 2004 10:43:57 -0400 writes:
Roger> I've noticed a change in the way grep() behaves between the 1.9.0
Roger> release and a recent R-patched. On 1.9.0 I get the following output:
>> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
Roger> [1] 84
Roger> And on R-patched (2004-06-11) I get
>> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))
>> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
Roger> [1] 13
I can reproduce this exactly.
<....>
Roger> I didn't find anything in the NEWs file that would indicate a change
yes: The src/extras/pcre/ (Perl Compatible Regular Expressions)
library was upgraded, and since we assumed that wouldn't
have any effect --- as we now see, a too optimistically ---
it wasn't documented in NEWS
Roger> and another problem is that I'm not sure which behavior is correct.
Roger> My knowledge of regular expressions is limited.
The first one is correct I think: '\w' means word constituents
(see below) and for 1.9.0,
you get
> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean"
[6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean"
[11] "l2so2tmean" "l2o3tmean" "l3pm10tmean" "l3pm25tmean" "l3cotmean"
[16] "l3no2tmean" "l3so2tmean" "l3o3tmean" "l4pm10tmean" "l4pm25tmean"
[21] "l4cotmean" "l4no2tmean" "l4so2tmean" "l4o3tmean" "l5pm10tmean"
[26] "l5pm25tmean" "l5cotmean" "l5no2tmean" "l5so2tmean" "l5o3tmean"
[31] "l6pm10tmean" "l6pm25tmean" "l6cotmean" "l6no2tmean" "l6so2tmean"
[36] "l6o3tmean" "l7pm10tmean" "l7pm25tmean" "l7cotmean" "l7no2tmean"
[41] "l7so2tmean" "l7o3tmean" "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean"
[46] "lm1no2tmean" "lm1so2tmean" "lm1o3tmean" "lm2pm10tmean" "lm2pm25tmean"
[51] "lm2cotmean" "lm2no2tmean" "lm2so2tmean" "lm2o3tmean" "lm3pm10tmean"
[56] "lm3pm25tmean" "lm3cotmean" "lm3no2tmean" "lm3so2tmean" "lm3o3tmean"
[61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean" "lm4no2tmean" "lm4so2tmean"
[66] "lm4o3tmean" "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean" "lm5no2tmean"
[71] "lm5so2tmean" "lm5o3tmean" "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean"
[76] "lm6no2tmean" "lm6so2tmean" "lm6o3tmean" "lm7pm10tmean" "lm7pm25tmean"
[81] "lm7cotmean" "lm7no2tmean" "lm7so2tmean" "lm7o3tmean"
>
which is correct AFAICS and shouldn't be shorted to the only 13 elements
> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean"
[6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean"
[11] "l2so2tmean" "l2o3tmean" "l3pm10tmean"
in R-patched.
------------
For me, 'man perlre' contains
>> \w Match a "word" character (alphanumeric plus "_")
<......>
>> A "\w" matches a single alphanumeric character or "_", not a whole
>> word. Use "\w+" to match a string of Perl-identifier characters (which
>> isn't the same as matching an English word). If "use locale" is in
>> effect, the list of alphabetic characters generated by "\w" is taken
>> from the current locale. See the perllocale manpage. .......
so it may well be connected to locale problems. But I don't
think any locale should have
"l2pm25tmean" matched by '^l\w+tmean' but not match
"lm5pm25tmean"
[If making a difference between these two, it should rather be
the other way round].
Martin Maechler
More information about the R-devel
mailing list