[R] Regular expressions: bug or misunderstanding?

Gabor Grothendieck ggrothendieck at gmail.com
Mon Jul 7 01:37:49 CEST 2008


Look at the discussion of zero width lookahead assertions in ?regex .
Use perl = TRUE as previously indicated.

On Sun, Jul 6, 2008 at 7:29 PM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
> On 06/07/2008 5:37 PM, (Ted Harding) wrote:
>>
>> On 06-Jul-08 21:17:04, Duncan Murdoch wrote:
>>>
>>> I'm trying to write a gsub() call that takes a string and escapes all the
>>> unescaped quote marks in it.  So the string
>>>
>>> \"
>>>
>>> would be left unchanged, but
>>>
>>> \\"
>>>
>>> would be changed to
>>>
>>> \\\"
>>>
>>> because the double backslash doesn't act as an escape for the quote,
>>> the first just escapes the second.  I have the usual problems of
>>> writing regular expressions involving backslashes which make
>>> everything I write completely unreadable, so I'm going to change
>>> the problem for this post:  I will define E to be the escape
>>> character, and q to be the quote; the gsub() call would leave
>>>
>>> Eq
>>>
>>> unchanged, but would change
>>>
>>> EEq
>>>
>>> to EEEq, etc.
>>>
>>> The expression I have come up with after this change is
>>>
>>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)
>>>
>>> i.e. "(start of line, or non-escape, followed by an even number of
>>> escapes), all of which we call expression 1, followed by a quote,
>>> is replaced by expression 1 followed by an escape and a quote".
>>>
>>> This works sometimes, but not always:
>>>
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
>>> [1] "Eq"
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
>>> [1] "EEEq"
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
>>> [1] "EqaEq"
>>>  > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
>>> [1] "qEq"
>>>
>>> Notice that in the final example, the first quote doesn't get escaped.
>>> Why not????
>>
>> I think (without having done the "experimental diagnostics")
>> that it's because in "qq" the first q mtaches (^|[^E]) because
>> it matches [^E] (i.e. is a "non-escape"); since it is followed
>> by q, it is the second q which gets the escape. Possibly you
>> need to include "^q" as an additional alternative match at the
>> start of the line.
>
> Thanks, that sounds right, but now I can't see how to fix it.  Is there
> syntax to say:  match A only if it follows B, but don't match the B part?
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list