[Rd] gsub('(.).(.)(.)', '\\3\\2\\1', 'gsub') (PR#13617)
waku at idi.ntnu.no
waku at idi.ntnu.no
Sun Mar 22 01:40:08 CET 2009
Full_Name: Wacek Kusnierczyk
Version: 2.10.0 r48181
OS: Ubuntu 8.04 Linux 32bit
Submission from: (NULL) (129.241.199.135)
there seems to be something wrong with r's regexing. consider the following
example:
gregexpr('a*|b', 'ab')
# positions: 1 2
# lengths: 1 1
gsub('a*|b', '.', 'ab')
# ..
where the pattern matches any number of 'a's or one b, and replaces the match
with a dot, globally. the answer is correct (assuming a dfa engine). however,
gregexpr('a*|b', 'ab', perl=TRUE)
# positions: 1 2
# lengths: 1 0
gsub('a*|b', '.', 'ab', perl=TRUE)
# .b.
where the pattern is identical, but the result is wrong. perl uses an nfa (if
it used a dfa, the result would still be wrong), and in the above example it
should find *four* matches, collectively including *all* letters in the input,
thus producing *four* dots (and *only* dots) in the output:
perl -le '
$input = qq|ab|;
print qq|match: "$_"| foreach $input =~ /a*|b/g;
$input =~ s/a*|b/./g;
print qq|output: "$input"|;'
# match: "a"
# match: ""
# match: "b"
# match: ""
# output: "...."
since with perl=TRUE both gregexpr and gsub seem to use pcre, i've checked the
example with pcretest, and also with a trivial c program (available on demand)
using the pcre api; there were four matches, exactly as in the perl bit above.
the results above are surprising, and suggest a bug in r's use of pcre rather
than in pcre itself. possibly, the issue is that when an empty sting is matched
(with a*, for example), the next attempt is not trying to match a non-empty
string at the same position, but rather an empty string again at the next
position. for example,
gsub('a|b|c', '.', 'abc', perl=TRUE)
# "...", correct
gsub('a*|b|c', '.', 'abc', perl=TRUE)
# ".b.c.", wrong
gsub('a|b*|c', '.', 'abc', perl=TRUE)
# "..c.", wrong (but now only 'c' remains)
gsub('a|b*|c', '.', 'aba', perl=TRUE)
# "...", incidentally correct
without detailed analysis of the code, i guess the bug is located somewhere in
src/main/pcre.c, and is distributed among the do_p* functions, so that multiple
fixes may be needed.
More information about the R-devel
mailing list