[R] Regular Expressions

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Nov 5 08:09:13 CET 2010

On Thu, 4 Nov 2010, Noah Silverman wrote:

> Hi,
> I'm trying to figure out how to use capturing parenthesis in regular 
> expressions in R.  (Doing this in Perl, Java, etc. is fairly trivial, but I 
> can't seem to find the functionality in R.)
> For example, given the string:    "10 Nov 13.00 (PFE1020K13)"
> I want to capture the first to digits and then the month abreviation.
> In perl, this would be
> /^(\d\d)\s(\w\w\w)\s/
> Then I have the variables $1 and $1 assigned to the capturing parenthesis.
> I've found the grep and sub commands in R, but the docs don't indicate any 
> way to capture things.
> Any suggestions?

Read the the link to ?regexp.  It *does* 'indicate the way to capture 

      The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
      previously matched by the Nth parenthesized subexpression of the
      regular expression.  (This is an extension for extended regular
      expressions: POSIX defines them only for basic ones.)

and there is an example on the help page for grep():

      ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
      gsub("([ab])", "\\1_\\1_", "abc and ABC")

In your example

x <- "10 Nov 13.00 (PFE1020K13)"
regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
sub(regex, "\\1", x, perl = TRUE)
sub(regex, "\\2", x, perl = TRUE)

A better way to do this would be something like

regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"

which is also a POSIX extended regexp.

