[R] Regex exercise

Sat Aug 21 01:39:02 CEST 2010

> For regular expression afficianados, I'd like a cleverer solution to
> the following problem (my solution works just fine for my needs; I'm
> just trying to improve my regex skills):
> 
> Given the string (entered, say, at a readline prompt):
> 
> "1  2 -5, 3- 6 4  8 5-7 10"  ## only integers will be entered
> 
> parse it to produce the numeric vector:
> 
> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
> 
> Note that "-" in the expression is used to indicate a range of values
> instead of ":"
> 
> Here's my UNclever solution:
> 
> First convert more than one space to a single space and then replace
> "<any spaces>-<any spaces>" by ":" by:
> 
> >  x1 <- gsub(" *- *",":",gsub(" +"," ",resp))  #giving
> > x1
> [1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains
> 
> Next convert the single string into a character vector via strsplit by
> splitting on anything but ":" or a digit:
> 
> > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]  #giving
> > x2
> [1] "1"    "2:5"  "3:6" "4"    "8"    "5:7"  "10"
> 
> Finally, parse() the vector, eval() each element, and unlist() the
> resulting list of numeric vectors:
> 
> >  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
> [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
> 
> 
> This seems far too clumsy and circumlocuitous not to have a more
> elegant solution from a true regex expert.
> 
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)

Howdy.  I don't know that I can produce anything less circumlocutory, but I
note that your "x2" form has a simple-enough structure that it can be further
parsed with regular expressions, i.e., as opposed to using parse and eval.  I
don't know that this is an improvement -- just a variation on the theme.

I've appended an example.

-- Mike

#### Original vector
x <- "1  2 -5, 3- 6 4  8 5-7 10"; x

#### Convert ranges to standard R form
x1 <- gsub("[ ]*-[ ]*", ":", x); x1

#### Get rid of the comma
x2 <- gsub(",", " ", x1); x2

#### Remove extra spaces
x3 <- gsub("[ ]+", " ", x2); x3

#### Split off elements, now in standard form
x4 <- unlist(strsplit(x3, " ")); x4

#### Use regular expression for simple parse of elements
x5 <- sapply(x4, function(a) {
          n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
          n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
          n1:n2}, USE.NAMES=FALSE); x5
x6 <- unlist(x5); x6

##########################################################

> #### Original vector
> x <- "1  2 -5, 3- 6 4  8 5-7 10"; x
[1] "1  2 -5, 3- 6 4  8 5-7 10"
> 
> #### Convert ranges to standard R form
> x1 <- gsub("[ ]*-[ ]*", ":", x); x1
[1] "1  2:5, 3:6 4  8 5:7 10"
> 
> #### Get rid of the comma
> x2 <- gsub(",", " ", x1); x2
[1] "1  2:5  3:6 4  8 5:7 10"
> 
> #### Remove extra spaces
> x3 <- gsub("[ ]+", " ", x2); x3
[1] "1 2:5 3:6 4 8 5:7 10"
> 
> #### Split off elements, now in standard form
> x4 <- unlist(strsplit(x3, " ")); x4
[1] "1"   "2:5" "3:6" "4"   "8"   "5:7" "10" 
> 
> #### Use regular expression for simple parse of elements
> x5 <- sapply(x4, function(a) {
+           n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
+           n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
+           n1:n2}, USE.NAMES=FALSE); x5
[[1]]
[1] 1

[[2]]
[1] 2 3 4 5

[[3]]
[1] 3 4 5 6

[[4]]
[1] 4

[[5]]
[1] 8

[[6]]
[1] 5 6 7

[[7]]
[1] 10

> x6 <- unlist(x5); x6
 [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>