[R] Regular Expression returning unexpected results

Lopez, Dan lopez235 at llnl.gov
Tue Oct 29 20:09:59 CET 2013


Hi Jeff,

I was reviewing my old lecture notes and see that the professor did use \1 so I think he was talking about regex in a non-platform specific context. But obviously \\1 is the way to do it in R.
The examples you gave me to study really helped.

I was also going to ask how to identify empty strings AND blank character strings but you will be happy to know that I figured it out on my own:
grep("^ *$",x)

Thank you. 

Thank you Sarah, Bert and David too

Dan


-----Original Message-----
From: Jeff Newmiller [mailto:jdnewmil at dcn.davis.CA.us] 
Sent: Tuesday, October 29, 2013 11:08 AM
To: Lopez, Dan; R help (r-help at r-project.org)
Subject: Re: [R] Regular Expression returning unexpected results

Please read and follow the Posting Guide, in particular re plain text email.

You need to keep in mind that the characters in literal strings in R source have to make it into RAM before the regex code can parse it. Since regex needs a single backslash to escape normal parsing and interpret 1 as a back reference, but the R parser also recognizes and removes backslashes in string literals as escape characters, you need to escape the backslash with a backslash in your R string literal. 

nchar tells you how many characters are in the string. print renders the string as it would need to be entered as R source code. cat sends the string directly to the output (console). Study the output of the following commands at the R prompt.

?Quotes

nchar("^([a-z]+) +\1 +[a-z]+ [0-9]")
print("^([a-z]+) +\1 +[a-z]+ [0-9]")
cat("^([a-z]+) +\1 +[a-z]+ [0-9]")

On most systems, a raw character code 1 is also known as Control-A, but the effect it has on the terminal used as the console may vary according to your setup, and it's effect on my system is  not clear to me.

nchar("^([a-z]+) +\\1 +[a-z]+ [0-9]")
print("^([a-z]+) +\\1 +[a-z]+ [0-9]")
cat("^([a-z]+) +\\1 +[a-z]+ [0-9]")
grep("^([a-z]+) +\\1 +[a-z]+ [0-9]",lines)

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.

"Lopez, Dan" <lopez235 at llnl.gov> wrote:
>Hi,
>
>So I just took an intro to R programming class and one of the lectures 
>was on Regular Expressions. I've been playing around with various R 
>functions that use Regular Expressions.
>But this has me stumped. This was part of a quiz and I got it right 
>through understanding the syntax. But when I try to run the thing it 
>returns 'integer(0)'. Can you please tell me what I am doing wrong?
>
>#I copied and pasted this:
>going up and up and up
>night night at 8
>bye bye from up high
>heading, heading by 9
>
>#THEN
>lines<-readLines("clipboard")
>#This is what it looks like in R
>lines
>[1] "going up and up and up"
>[2] "night night at 8"
>[3] "bye bye from up high"
>[4] "heading, heading by 9"
>
>#THIS IS WHAT IS NOT WORKING THE WAY I THOUGHT. I was expecting it to 
>return 2.
># "night night at 8" follows the pattern: Begins with a word then has 
>at least one space then the same word then has at least one space then 
>a word then a space then a single digit number.
>grep("^([a-z]+) +\1 +[a-z]+ [0-9]",lines)
>integer(0)
>
>#But simple examples DO work
>grep("[Hh]",lines)
>[1] 2 3 4
>grep('[0-9]',lines)
>[1] 2 4
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list