[R] Working with string

Marc Schwartz marc_schwartz at me.com
Thu Jul 7 18:24:29 CEST 2011


On Jul 7, 2011, at 11:21 AM, Bogaso Christofer wrote:

> Hi there, I have to extract some relevant portion from a defined string,
> which is a mix of numeric and character. However this has following
> sequence:
> 
> 
> 
> Some String - Some numerical - "c/C" (or "p/P") - then again some set of
> numbers.
> 
> 
> 
> Examples of such string is "fdahsdfcha163517253c463278643" or
> "fdahsdfcha163517253C463278643" or "fdahsdfcha163517253P463278643",
> "fdahsdfcha163517253p463278643" etc.
> 
> 
> 
> I have tried using latest stringr package to accomplice that. Here is my
> try:
> 
> 
> 
>> library(stringr)
> 
>> str_extract("fdahsdfcha163517253c463278643", "[c]")
> 
> [1] "c"
> 
> 
> 
> But it seems that, above code fetching "c" from "fdahsdfcha" only. My goal
> is to understand what is there between above 2 set of numbers, "C/c/P/p"?
> Can somebody help me how to do that? I would like to use stringr syntax
> because, I am already using lot of other functions from that. Therefore if I
> can do it using that package then it would be good in terms of consistency.
> 
> 
> 
> Thanks for your help.


I don't use 'stringr', but you can get the desired result using ?gsub:

x <- c("fdahsdfcha163517253c463278643", "fdahsdfcha163517253C463278643", 
       "fdahsdfcha163517253P463278643", "fdahsdfcha163517253p463278643")


> gsub(".+[0-9]+([cCpP])[0-9]+", "\\1", x)
[1] "c" "C" "P" "p"


The regex in the first argument tells gsub to find a sequence of any characters, followed by a sequence of numbers, followed a by single 'c', 'C', 'p' or 'P', finally followed by a sequence of numbers.

Surrounding the [cCpP] in parens allows us to use a 'back reference' and return what is found within the parens using the "\\1" in the second argument.

>From a brief review of the stringr manual, it looks like str_extract() supports the use of a regex for the pattern argument, but does not support the use of back references. It looks like str_replace_all() is a wrapper to gsub(), so you may want to look at that function and the examples for it. Thus, the syntax might be something like:

  str_replace_all(x, ".+[0-9]+([cCpP])[0-9]+", "\\1")

and therefore, I am not sure what you are really saving by using it versus gsub() directly.

HTH,

Marc Schwartz



More information about the R-help mailing list