[R] Regular Expressions

Fri Nov 5 08:09:13 CET 2010

On Thu, 4 Nov 2010, Noah Silverman wrote:

> Hi,
>
> I'm trying to figure out how to use capturing parenthesis in regular 
> expressions in R.  (Doing this in Perl, Java, etc. is fairly trivial, but I 
> can't seem to find the functionality in R.)
>
> For example, given the string:    "10 Nov 13.00 (PFE1020K13)"
>
> I want to capture the first to digits and then the month abreviation.
>
> In perl, this would be
>
> /^(\d\d)\s(\w\w\w)\s/
>
> Then I have the variables $1 and $1 assigned to the capturing parenthesis.
>
> I've found the grep and sub commands in R, but the docs don't indicate any 
> way to capture things.
>
> Any suggestions?

Read the the link to ?regexp.  It *does* 'indicate the way to capture 
things'.

      The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
      previously matched by the Nth parenthesized subexpression of the
      regular expression.  (This is an extension for extended regular
      expressions: POSIX defines them only for basic ones.)

and there is an example on the help page for grep():

      ## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'
      gsub("([ab])", "\\1_\\1_", "abc and ABC")

In your example

x <- "10 Nov 13.00 (PFE1020K13)"
regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
sub(regex, "\\1", x, perl = TRUE)
sub(regex, "\\2", x, perl = TRUE)

A better way to do this would be something like

regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"

which is also a POSIX extended regexp.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595