[R] Regular Expressions
Brian Diggs
diggsb at ohsu.edu
Fri Nov 5 18:53:57 CET 2010
On 11/5/2010 12:09 AM, Prof Brian Ripley wrote:
> On Thu, 4 Nov 2010, Noah Silverman wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to use capturing parenthesis in regular
>> expressions in R. (Doing this in Perl, Java, etc. is fairly trivial,
>> but I can't seem to find the functionality in R.)
>>
>> For example, given the string: "10 Nov 13.00 (PFE1020K13)"
>>
>> I want to capture the first to digits and then the month abreviation.
>>
>> In perl, this would be
>>
>> /^(\d\d)\s(\w\w\w)\s/
>>
>> Then I have the variables $1 and $1 assigned to the capturing
>> parenthesis.
>>
>> I've found the grep and sub commands in R, but the docs don't indicate
>> any way to capture things.
>>
>> Any suggestions?
>
> Read the the link to ?regexp. It *does* 'indicate the way to capture
> things'.
>
> The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
> previously matched by the Nth parenthesized subexpression of the
> regular expression. (This is an extension for extended regular
> expressions: POSIX defines them only for basic ones.)
>
> and there is an example on the help page for grep():
>
> ## Double all 'a' or 'b's; "\" must be escaped, i.e., 'doubled'
> gsub("([ab])", "\\1_\\1_", "abc and ABC")
>
> In your example
>
> x <- "10 Nov 13.00 (PFE1020K13)"
> regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
> sub(regex, "\\1", x, perl = TRUE)
> sub(regex, "\\2", x, perl = TRUE)
>
> A better way to do this would be something like
>
> regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"
>
> which is also a POSIX extended regexp.
Is there a standard, built in way to get both (all) backreferences at
the same time with just one call to sub (or the appropriate function)?
I can cobble something together specifically for 2 backreferences (not
extensively tested):
both_backrefs <- function(pattern, x) {
s <- sub(pattern, "\\1\034\\2", x)
matrix(unlist(strsplit(s,"\034")), ncol=2, byrow=TRUE)
}
both_backrefs(regex, x)
However, putting the parts back together into a string (with a delimiter
that hopefully won't be in the string otherwise) just to use strsplit to
pull them apart seems inelegant (as does making multiple calls to
sub()). sub() (and siblings) surely already have the backreferences as
strings at some point in the processing, but I don't see a way to return
them as a vector or matrix, only to substitute using backreferences
(sub) or return indicies pointing to where the matches start (regexpr)
or return the whole string matches (grep with value=TRUE).
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
More information about the R-help
mailing list