[R] Regular Expressions
Noah Silverman
noah at smartmediacorp.com
Fri Nov 5 08:29:37 CET 2010
That's perfect!
Don't know how I missed that.
I want to start playing with some modeling of financial data and the
only format I can download is rather ugly. So my plan is to use a
series of Regex to extract what I want.
Noticed that you are a Prof. in applied stats. I'm at UCLA working on
an MS in stats. My department is fairly flexible, so I'm taking several
finance courses as part of my work. Currently debating if I want to
graduate with an MS in June, or roll everything into a PhD and be
finished in an extra 1-2 years.
Thanks!
-N
On 11/5/10 12:09 AM, Prof Brian Ripley wrote:
> On Thu, 4 Nov 2010, Noah Silverman wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to use capturing parenthesis in regular
>> expressions in R. (Doing this in Perl, Java, etc. is fairly trivial,
>> but I can't seem to find the functionality in R.)
>>
>> For example, given the string: "10 Nov 13.00 (PFE1020K13)"
>>
>> I want to capture the first to digits and then the month abreviation.
>>
>> In perl, this would be
>>
>> /^(\d\d)\s(\w\w\w)\s/
>>
>> Then I have the variables $1 and $1 assigned to the capturing
>> parenthesis.
>>
>> I've found the grep and sub commands in R, but the docs don't
>> indicate any way to capture things.
>>
>> Any suggestions?
>
> Read the the link to ?regexp. It *does* 'indicate the way to capture
> things'.
>
> The backreference ‘\N’, where ‘N = 1 ... 9’, matches the substring
> previously matched by the Nth parenthesized subexpression of the
> regular expression. (This is an extension for extended regular
> expressions: POSIX defines them only for basic ones.)
>
> and there is an example on the help page for grep():
>
> ## Double all 'a' or 'b's; "\" must be escaped, i.e., 'doubled'
> gsub("([ab])", "\\1_\\1_", "abc and ABC")
>
> In your example
>
> x <- "10 Nov 13.00 (PFE1020K13)"
> regex <- "(\\d\\d)\\s(\\w\\w\\w).*"
> sub(regex, "\\1", x, perl = TRUE)
> sub(regex, "\\2", x, perl = TRUE)
>
> A better way to do this would be something like
>
> regex <- "([[:digit:]]{2})\\s([[:alpha:]]{3}).*"
>
> which is also a POSIX extended regexp.
>
More information about the R-help
mailing list