[R] extracting values from txt with regular expression

Fri Jun 8 12:31:37 CEST 2012

On Thu, Jun 7, 2012 at 1:40 PM, emorway <emorway at usgs.gov> wrote:
> Thanks for your suggestions.  Bert, in your response you raised my awareness
> to "regular expressions".  Are regular expressions the same across various
> languages?  Consider the following line of text:
>
> txt_line<-" PERCENT DISCREPANCY =           0.01     PERCENT DISCREPANCY =
> -0.05"
>
> It seems python uses the following line of code to extract the two values in
> "txt_line" and store them in a variable called "v":
>
> v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line)
> #v[0]  0.01
> #v[1]  -0.05
>
> I tried something similar in R (but it didn't work) by using the same
> regular expression, but got an error:
>
> edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line)
> #Error: '\d' is an unrecognized escape in character string starting "[+-]?
> *(?:\d"
>
> I'm not even sure which function in R most efficiently extracts the values
> from "txt_line".  Basically, I want to peel out the values and think I can
> use the decimal point to construct the regular expression, but don't know
> where to go from here?

Try this.  strapply applies the function (3rd argument) to each match
of the regular expressoin (2nd argument) outputting the result of the
function.  The regular expression we have used matches a minus or
digit followed by non-spaces.  That seems good enough for this simple
example but, of course, it can be changed.

> library(gsubfn)
> p <- "[-0-9]\\S+"
> txt_line <- " PERCENT DISCREPANCY =           0.01     PERCENT DISCREPANCY = -0.05"
>
> strapply(txt_line, p, as.numeric)[[1]]
[1]  0.01 -0.05

or using strapplyc (which is similar but uses c as the function) and
is optimized for speed:

> as.numeric(strapplyc(txt_line, p)[[1]])
[1]  0.01 -0.05

If we are only parsing a few lines then the speed does not matter but
if there are large amounts to parse then be sure to have the tcltk
package installed to get the best speed from the gsubfn functions (on
Windows and most but not all Linux systems tcltk is installed by
default but on a few you have to do it yourself).  If you don't have
tcltk the gsubfn package will use R which is slower.  Also, as noted,
strapplyc is faster than strapply.  There are arguments and options
that can override the defaults.

The gsubfn home page is at http://gsubfn.googlecode.com

regular expressions are largely the same but not 100% identical across
languages.  There are some links to regular expression info in
different languages at the bottom of the home page just listed.   R
can use R or perl regular expressions and the gsubfn functions, in
addition, can use tcl regular expressions.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com