[R] Need some help with regular expression

David Winsemius dwinsemius at comcast.net
Thu Dec 15 21:29:47 CET 2016


> On Dec 15, 2016, at 8:46 AM, Steven Nagy <nstefi at gmail.com> wrote:
> 
> I tried to send this email, but it didn't go through. I guess pictures are
> not allowed to send through HTML formatted emails?
> I'm re-sending it again without the picture, just comment there instead as
> placeholder.
> 
> Thanks,
> Steven
> 
> 
> From: Steven Nagy [mailto:nstefi at gmail.com] 
> Sent: Monday, December 12, 2016 10:50 PM
> To: 'Bert Gunter' <bgunter.4567 at gmail.com>
> Cc: 'R-help' <r-help at r-project.org>
> Subject: RE: [R] Need some help with regular expression
> 
> Hi Bert and all,
> 
> Sorry I was too busy at work and didn't have much time to continue this
> until now.
> So I studied "?regexp" and I can understand your regular expression now:
> sub(".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*","\\1",x)
> 
> But I also wanted to split up these results in 2 columns, so your previous
> command would give me this result:
> [1] "NMA -> STU" "STU -> REG" "-> STU"
> 
> and I wanted to further split them up to show this:
> From	To
> NMA	STU
> STU	REG
> 	STU

So one more step:

> strsplit( sub(".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*","\\1",x), split="-> ")

[[1]]
[1] "NMA " "STU" 

[[2]]
[1] "STU " "REG" 

[[3]]
[1] ""    "STU"

> 
Well, maybe 2:

> sapply(  strsplit( sub(".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*","\\1",x), split="-> "), "[",1 )
[1] "NMA " "STU " ""    
> sapply(  strsplit( sub(".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*","\\1",x), split="-> "), "[",2 )
[1] "STU" "REG" "STU"
> 


> I still don’t quite understand the backreferences, and how could I have 2
> backreferences, one for the left side of the “->” sign and one for the right
> side?
> 
> So it seems like I need to apply the “sub” function twice, similar how I
> used the “strapply” function twice in my original post:
> strapply(strapply(a, "(file://w+ -> STU|STU -> file://w+)", c, backref = -1,
> perl = TRUE), "(file://w+) -> (file://w+)", c, backref = -2, perl = TRUE)
> 
> or maybe there would be a more simple way of using only 1 “sub” function and
> 2 backreferences?
> 
> Also I’m not sure what do I do after I get the data? How could I represent
> the member type changes graphically? We need to analyze the behavior of
> switching from STU to another type or from another type to STU.
> Google Analytics has a nice chart under Behavior Flow, or Users Flow, and it
> looks like this:
> <here was my picture from Google Analytics - it's from Behavior Flow or
> Users Flow showing flows from one category to another one and further to
> another one>
> 
> 
> 
> Is there any graphical representation in R that is similar to this?
> 
> Thanks a lot,
> Steven
> 
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> Sent: Sunday, November 20, 2016 10:05 PM
> To: Aliz Csonka <mailto:lyzae.ro at gmail.com>
> Cc: R-help <mailto:r-help at r-project.org>
> Subject: Re: [R] Need some help with regular expression
> 
> Although others may respond, I think you will do much better studying
> ?regexp, which will answer all your questions. I believe the effort you will
> make figuring it out will pay dividends for your future R/regular expression
> usage that you cannot gain from my direct explanation.
> 
> Good luck.
> 
> Best,
> Bert
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> 
> On Sun, Nov 20, 2016 at 6:40 PM, Steven Nagy <mailto:nstefi at gmail.com>
> wrote:
>> Thanks a lot Bert. That's amazing. I am very new to both R and regular 
>> expressions. I don't really understand the regular expression that you 
>> used below.
>> And looks like I don't even need any special library, like the 
>> "gsubfn" for the strapply function.
>> I was trying to use the regexr.com website to analyze your regular 
>> expression, but it doesn't seem to match any text there.
>> Can you explain me the regular expression that you used?
>> ".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*"
>> So the dot in the front means any character and the star after that 
>> means that it can repeat 0 or more times, right?
>> Then followed by a colon character ":" and a space, and what is the 
>> next star after that? It means that the sequence before that again can 
>> repeat 0 or more times?
>> And what are the double square brackets?
>> Is ":alnum:" specific to R? I don't think "regexr.com" understands 
>> that. Or maybe that site is for regular expressions in Javascript, and 
>> the syntax is different in R?
>> 
>> Thank you,
>> Steven
>> 
>> -----Original Message-----
>> From: Bert Gunter [mailto:bgunter.4567 at gmail.com]
>> Sent: Sunday, November 20, 2016 2:15 PM
>> To: Steven Nagy <mailto:nstefi at gmail.com>
>> Cc: R-help <mailto:r-help at r-project.org>
>> Subject: Re: [R] Need some help with regular expression
>> 
>> If I understand you correctly, I think you are making it more complex 
>> than necessary. Using your example (thanks!!), the following should 
>> get you
>> started:
>> 
>> 
>>> x<- c("Name.MEMBER_TYPE: NMA -> STU ; CATEGORY:  -> 1 ; CITY:
>>> MISSISSAUGA -> Mississauga ; ZIP: L5N1H9 -> L5N 1H9 ; COUNTRY: CAN -> 
>>> ; MEMBER_STATUS:  -> N", "Name.MEMBER_TYPE: STU -> REG ; CATEGORY: 1
>>> ->","Name.MEMBER_TYPE: -> STU")
>>> 
>>> x
>> [1] "Name.MEMBER_TYPE: NMA -> STU ; CATEGORY:  -> 1 ; CITY:
>> MISSISSAUGA -> Mississauga ; ZIP: L5N1H9 -> L5N 1H9 ; COUNTRY: CAN -> 
>> ;
>> MEMBER_STATUS:  -> N"
>> 
>> [2] "Name.MEMBER_TYPE: STU -> REG ; CATEGORY: 1 ->"
>> [3] "Name.MEMBER_TYPE: -> STU"
>>> 
>>> sub(".*: *([[:alnum:]]* *-> *STU|STU *-> *[[:alnum:]]*).*","file://1",x)
>> [1] "NMA -> STU" "STU -> REG" "-> STU"
>> 
>> 
>> I am sure that you can get things to the form you desire in one go 
>> with some fiddling of the above, but it was easier for me to write the 
>> regex to pick out the pieces you wanted and leave the rest to you.
>> Others may have slicker ways to do it, of course.
>> 
>> HTH
>> 
>> Cheers,
>> Bert
>> 
>> 
>> Bert Gunter
>> 
>> "The trouble with having an open mind is that people keep coming along 
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>> 
>> 
>> On Sat, Nov 19, 2016 at 8:06 PM, Steven Nagy <mailto:nstefi at gmail.com>
> wrote:
>>> I tried out a regular expression on this website:
>>> 
>>> http://regexr.com/3en1m
>>> 
>>> 
>>> 
>>> So the input text is:
>>> 
>>> "Name.MEMBER_TYPE:  -> STU"
>>> 
>>> 
>>> 
>>> The regular expression is: ((?:\w+|\s) -> STU|STU -> (?:\w+|\s))
>>> 
>>> And it returns:
>>> 
>>> "  -> STU"
>>> 
>>> 
>>> 
>>> but when I use in R, it doesn't return the same result:
>>> 
>>> strapply(c, "((?:\\w+|\\s) -> STU|STU -> (?:\\w+|\\s))", c, backref = 
>>> -1, perl = TRUE)
>>> 
>>> returns:
>>> "Name.MEMBER_TYPE: -> STU"
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Here is what I was trying to do:
>>> 
>>> 
>>> 
>>> I need to extract some values from a log table, and I created a 
>>> regular expression that helps me with that.
>>> 
>>> The log table has cells with values like:
>>> 
>>> a = "Name.MEMBER_TYPE: NMA -> STU ; CATEGORY:  -> 1 ; CITY:
>>> MISSISSAUGA -> Mississauga ; ZIP: L5N1H9 -> L5N 1H9 ; COUNTRY: CAN -> 
>>> ; MEMBER_STATUS:  -> N"
>>> 
>>> or
>>> b = "Name.MEMBER_TYPE: STU -> REG ; CATEGORY: 1 ->"
>>> 
>>> so I needed to extract the values that a STU member type is changing 
>>> from and to, so I needed NMA, STU in the 1st case or STU, REG in the 
>>> 2nd
>> case.
>>> 
>>> I came up with this expression which worked in both cases:
>>> 
>>> strapply(strapply(a, "(file://w+ -> STU|STU -> file://w+)", c, backref =
> -1, 
>>> perl = TRUE), "(file://w+) -> (file://w+)", c, backref = -2, perl = TRUE)
>>> 
>>> 
>>> 
>>> But I had a 3rd case when the source member type was blank:
>>> 
>>> c = "Name.MEMBER_TYPE: -> STU"
>>> 
>>> and in that case it returned an error:
>>> 
>>> strapply(strapply(c, "(file://w+ -> STU|STU -> file://w+)", c, backref =
> -1, 
>>> perl = TRUE), "(file://w+) -> (file://w+)", c, backref = -2, perl = TRUE)
>>> 
>>> Error: is.character(x) is not TRUE
>>> 
>>> 
>>> 
>>> I found that the error is because this returns NULL:
>>> 
>>> strapply(c, "(file://w+ -> STU|STU -> file://w+)", c, backref = -1, perl
> = 
>>> TRUE)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> So I tried to modify the regular expression to match any word or 
>>> blank
>>> space:
>>> 
>>> strapply(c, "((?:\\w+|\\s) -> STU|STU -> (?:\\w+|\\s))", c, backref = 
>>> -1, perl = TRUE)
>>> 
>>> 
>>> 
>>> but this returned me the whole value of "c":
>>> 
>>> "Name.MEMBER_TYPE:  -> STU"
>>> 
>>> and I only needed "  -> STU" as it shows on the website regxr.com
>>> 
>>> 
>>> 
>>> Is the result wrong on the regxr.com website or strapply returns the 
>>> wrong result?
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> Steven
>>> 
>>> 
>>>          [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> ______________________________________________
> mailto:R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list