[R] Search and extract string function
Gabor Grothendieck
ggrothendieck at gmail.com
Thu Jul 15 17:47:33 CEST 2010
On Thu, Jul 15, 2010 at 10:48 AM, AndrewPage <savejarvis at yahoo.com> wrote:
>
> Hi all,
>
> I'm trying to write a function that will search and extract from a long
> character string, but with a twist: I want to use the characters before and
> the characters after what I want to extract as reference points. For
> example, say I'm working with data entries that looks like this:
>
> Drink=Coffee:Location=Office:Time=Morning:Market=Flat
>
> Drink=Water:Location=Office:Time=Afternoon:Market=Up
>
> Drink=Water:Location=Gym:Time=Evening:Market=Closed
>
> Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed
>
>
> ...
>
> For my function, I'd like to find what's located between "Location=", and
> ":Time=" in every instance, and extract it, to return something like
> "Office, Office, Gym, Restaurant".
>
> In a previous discussion I found
> (http://tolstoy.newcastle.edu.au/R/help/05/03/0344.html), someone wrote a
> function where you could find and substitute characters in a string, based
> on "pre" and "post" variables:
>
> interp <- function(x, e = parent.frame(), pre = "\\$", post = "" ) {
> for(el in ls(e)) {
> tag <- paste(pre, el, post, sep = "")
> if (length(grep(tag, x))) x <- gsub(tag, eval(parse(text = el), e), x)
> }
> x
> }
>
> I'm not sure how to modify it, however, to do what I want it to do. Any
> suggestions?
The strapply function in gsubfn can do that. By default it returns
the back reference, i.e. the part of the regular expression between
parentheses:
> s <- c("Drink=Coffee:Location=Office:Time=Morning:Market=Flat",
+ "Drink=Water:Location=Office:Time=Afternoon:Market=Up",
+ "Drink=Water:Location=Gym:Time=Evening:Market=Closed",
+ "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed")
>
> library(gsubfn)
> strapply(s, "Location=(.*):Time", simplify = TRUE)
[1] "Office" "Office" "Gym" "Restaurant"
>
> # since we know that the field we want is composed of
> # word characters and followed by a non-word character
> # we can even avoid specifying :Office by specifying
> # word characters (\\w+) instead:
>
> strapply(s, "Location=(\\w+)", simplify = TRUE)
[1] "Office" "Office" "Gym" "Restaurant"
See http://gsubfn.googlecode.com for more.
More information about the R-help
mailing list