[R] Why do my regular expressions require a double escape \\ to get a literal??

Berend Hasselman bhh at xs4all.nl
Fri Mar 2 15:31:17 CET 2012


On 02-03-2012, at 14:13, Roey Angel wrote:

> Hi Bernard, thanks for the quick reply.
> Of course, I understand that an escape is needed because parenthesis are reserved symbols in regular expressions.
> My problem is that if I just use \( I get the error:
> 
> Error: '\(' is an unrecognized escape in character string starting "\("
> 
> so in order to get a literal ( I need to use \\(
> which is odd cause I've never encountered that in any other language and also all the R manuals dont mention that.
> 

It is not odd as the previous poster has already mentioned.

I have encountered this (e.g. awk).

You need the \\ because the expression between tour quotes is interpreted twice:
once and first as a character string (in which \( is illegal but \\ is legal) and then as a regular expression in which you want to match a literal ( and ) which must be escaped in the regular expression since they are meta characters.

If you don't like doing that (the \\) use this instead

as.data.frame(apply(tax.data, 2, function(x) gsub('[(].*[)]','',x)))

i.e. put the ( and ) in a character class.

Berend



>> On 02-03-2012, at 09:36, Roey Angel wrote:
>> 
>>> Hi,
>>> I was recently misfortunate enough to have to use regular expressions to sort out some data in R.
>>> I'm working on a data file which contains taxonomical data of bacteria in hierarchical order.
>>> A sample of this file can be generated using:
>>> 
>>> tax.data<- read.table(header=F, con<- textConnection('
>>> G9SS7BA01D15EC  Bacteria(100)    Cyanobacteria(84)    unclassified
>>> G9SS7BA01C9UIR    Bacteria(100)    Proteobacteria(94)    Alphaproteobacteria(89)
>>> G9SS7BA01CM00D    Bacteria(100)    Proteobacteria(99)    Alphaproteobacteria(99)
>>> '))
>>> close(con)
>>> 
>>> What I try to do is to remove the parenthesis and the number inside (which could contain a decimal point)
>>> I assumed that the following command would solve it, but instead I got an error.
>>> 
>>> tax.data<- as.data.frame(apply(tax.data, 2, function(x) gsub('\(.*\)','',x)))
>>> Error: '\(' is an unrecognized escape in character string starting "\("
>>> 
>>> And it doesn't matter if I use perl = TRUE or not.
>>> To solve it I need to use a double escape sign '\\' before opening and closing the parenthesis:
>>> 
>>> tax.data<- as.data.frame(apply(tax.data, 2, function(x) gsub('\\(.*\\)','',x)))
>>> 
>>> This yields the desired result but I wonder why it does that?
>>> No other regular expression system I'm used to (e.g. Perl, Shell) works like that.
>>> 
>>> I'm using R 2.14 (but also R 2.10) and I get the same results on Ubuntu and win XP.
>>> 
>>> I'd appreciate any explanation.
>> Section "Character vectors" in the R Intro manual.
>> 
>> ?Quotes
>> 
>> The regular expression is provided as a string to gsub. In strings there are escape sequences.
>> To get the \ as a single \ to the regular expression parser it has to be \-ed in the string stage: \\
>> 
>> Berend
>> 
>> 
> <angel.vcf>



More information about the R-help mailing list