[R] regexp,grep: capturing more than one substring

Gabor Grothendieck ggrothendieck at myway.com
Wed Oct 27 16:02:17 CEST 2004


Marc Mamin <M.Mamin <at> intershop.de> writes:

: 
: Hello,
: 
: I would like to have a function that retrieve matching strings in the same 
way as with java.util.regex (java 1.4.2).
: 
: Example:
: 
: f('^.*(xx?)\\.([0-9]*)$','abcxx.785')
: =>
: c('xx','785')
: 
: First of all: Is it possible to achiev this with grep(... 
perl=TRUE,value=TRUE )?

Actually you don't even need perl= to do that.  The
function below pastes togther a string like "\\1 \\2" 
where n determines how many of them there are.  
Then it uses gsub with the regexp in r.  Finally it is
split into individual strings.

The calculation of n, the number of backreferences, is
not foolproof so you can specify your own n if your
expression has parentheses that are not backreferences.
Also specifying n might speed it up a bit, e.g. n = 2
in the example.  The value of sep= should be a delimiter
not in your string.

s can be a vector of strings.  It returns in a list of
strings in any case, one element of the list for each
element of vector s.  If s is just a scalar string
then it will return a one element list containing
the elements as a vector.  You may wish to call it
like this f(...args...)[[1]] in that case as
shown in the example.

f <- function(r, s, n = nchar(gsub("[^(]","",r)), sep = "\10" ) {
    x <- gsub(r, paste("\\", 1:n, sep = "", collapse = sep), s)
    strsplit(x, split = sep)
}
f( '^.*(xx?)\\.([0-9]*)$', 'abcxx.785' )[[1]]




More information about the R-help mailing list