[R] extracting values from txt with regular expression

Nordlund, Dan (DSHS/RDA) NordlDJ at dshs.wa.gov
Thu Jun 7 20:47:26 CEST 2012


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of emorway
> Sent: Thursday, June 07, 2012 10:41 AM
> To: r-help at r-project.org
> Subject: [R] extracting values from txt with regular expression
> 
> Thanks for your suggestions.  Bert, in your response you raised my
> awareness
> to "regular expressions".  Are regular expressions the same across
> various
> languages?  Consider the following line of text:
> 
> txt_line<-" PERCENT DISCREPANCY =           0.01     PERCENT
> DISCREPANCY =
> -0.05"
> 
> It seems python uses the following line of code to extract the two
> values in
> "txt_line" and store them in a variable called "v":
> 
> v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line)
> #v[0]  0.01
> #v[1]  -0.05
> 
> I tried something similar in R (but it didn't work) by using the same
> regular expression, but got an error:
> 
> edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line)
> #Error: '\d' is an unrecognized escape in character string starting
> "[+-]?
> *(?:\d"
> 
> I'm not even sure which function in R most efficiently extracts the
> values
> from "txt_line".  Basically, I want to peel out the values and think I
> can
> use the decimal point to construct the regular expression, but don't
> know
> where to go from here?
> 

I am a regular expression novice, but the error message you are receiving is the result of not doubling the backslashes in your regular expression pattern.  The backslash needs to be escaped.  So this will get you close to what you want (although not necessarily efficiently).

ndx <- gregexpr("[+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?",txt_line)
matched <- regmatches(txt_line, ndx)
matched


Hope this is helpful,

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204




More information about the R-help mailing list