[R] Regex exercise

Bert Gunter gunter.berton at gene.com
Sat Aug 21 02:21:54 CEST 2010


Thanks Michael:

You are essentially doing the eval and parsing by hand instead of
letting eval(parse()) do the work. I prefer the latter.

However, your code did something that I did not expect and for which I
can find no documentation -- I would have thought it shouldn't work.

... and that is, the return of your sapply is n1:n2  where n1 and n2
are _character values_ (because that's what gsub returns, of course).
I would have thought this would give an error, but in fact it gives
the "correct" result. That is, to my complete surprise:

> "3":"5"
[1] 3 4 5
> seq(from= "3", to= "5")
[1] 3 4 5
> seq.int( "3", "5")
[1] 3 4 5
> "3":5
[1] 3 4 5

all work! Is this behavior documented anywhere and I've missed it or
is it a secret "feature."  And to what extent does it work, noting
that:

seq(from="3.5",to="5.5",by="1")
Error in to - from : non-numeric argument to binary operator


Cheers,
Bert

On Fri, Aug 20, 2010 at 4:39 PM, Michael Hannon <jm_hannon at yahoo.com> wrote:
>> For regular expression afficianados, I'd like a cleverer solution to
>> the following problem (my solution works just fine for my needs; I'm
>> just trying to improve my regex skills):
>>
>> Given the string (entered, say, at a readline prompt):
>>
>> "1  2 -5, 3- 6 4  8 5-7 10"  ## only integers will be entered
>>
>> parse it to produce the numeric vector:
>>
>> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
>>
>> Note that "-" in the expression is used to indicate a range of values
>> instead of ":"
>>
>> Here's my UNclever solution:
>>
>> First convert more than one space to a single space and then replace
>> "<any spaces>-<any spaces>" by ":" by:
>>
>> >  x1 <- gsub(" *- *",":",gsub(" +"," ",resp))  #giving
>> > x1
>> [1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains
>>
>> Next convert the single string into a character vector via strsplit by
>> splitting on anything but ":" or a digit:
>>
>> > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]  #giving
>> > x2
>> [1] "1"    "2:5"  "3:6" "4"    "8"    "5:7"  "10"
>>
>> Finally, parse() the vector, eval() each element, and unlist() the
>> resulting list of numeric vectors:
>>
>> >  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
>> [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>>
>>
>> This seems far too clumsy and circumlocuitous not to have a more
>> elegant solution from a true regex expert.
>>
>> (Special note to Thomas Lumley: This seems one of the few instances
>> where eval(parse..)) may actually be appropriate.)
>
> Howdy.  I don't know that I can produce anything less circumlocutory, but I
> note that your "x2" form has a simple-enough structure that it can be further
> parsed with regular expressions, i.e., as opposed to using parse and eval.  I
> don't know that this is an improvement -- just a variation on the theme.
>
> I've appended an example.
>
> -- Mike
>
> #### Original vector
> x <- "1  2 -5, 3- 6 4  8 5-7 10"; x
>
> #### Convert ranges to standard R form
> x1 <- gsub("[ ]*-[ ]*", ":", x); x1
>
> #### Get rid of the comma
> x2 <- gsub(",", " ", x1); x2
>
> #### Remove extra spaces
> x3 <- gsub("[ ]+", " ", x2); x3
>
> #### Split off elements, now in standard form
> x4 <- unlist(strsplit(x3, " ")); x4
>
> #### Use regular expression for simple parse of elements
> x5 <- sapply(x4, function(a) {
>          n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
>          n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
>          n1:n2}, USE.NAMES=FALSE); x5
> x6 <- unlist(x5); x6
>
> ##########################################################
>
>> #### Original vector
>> x <- "1  2 -5, 3- 6 4  8 5-7 10"; x
> [1] "1  2 -5, 3- 6 4  8 5-7 10"
>>
>> #### Convert ranges to standard R form
>> x1 <- gsub("[ ]*-[ ]*", ":", x); x1
> [1] "1  2:5, 3:6 4  8 5:7 10"
>>
>> #### Get rid of the comma
>> x2 <- gsub(",", " ", x1); x2
> [1] "1  2:5  3:6 4  8 5:7 10"
>>
>> #### Remove extra spaces
>> x3 <- gsub("[ ]+", " ", x2); x3
> [1] "1 2:5 3:6 4 8 5:7 10"
>>
>> #### Split off elements, now in standard form
>> x4 <- unlist(strsplit(x3, " ")); x4
> [1] "1"   "2:5" "3:6" "4"   "8"   "5:7" "10"
>>
>> #### Use regular expression for simple parse of elements
>> x5 <- sapply(x4, function(a) {
> +           n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a)
> +           n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a)
> +           n1:n2}, USE.NAMES=FALSE); x5
> [[1]]
> [1] 1
>
> [[2]]
> [1] 2 3 4 5
>
> [[3]]
> [1] 3 4 5 6
>
> [[4]]
> [1] 4
>
> [[5]]
> [1] 8
>
> [[6]]
> [1] 5 6 7
>
> [[7]]
> [1] 10
>
>> x6 <- unlist(x5); x6
>  [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
>>
>
>
>
>



-- 
Bert Gunter
Genentech Nonclinical Biostatistics
467-7374
http://devo.gene.com/groups/devo/depts/ncb/home.shtml



More information about the R-help mailing list