[R] element wise pattern recognition and string substitution
Bert Gunter
bgunter.4567 at gmail.com
Wed Sep 7 05:56:30 CEST 2016
Jeff:
Not sure what you meant by this:
"There is no other reason to put parentheses in the pattern... they
are not grouping symbols."
... but in fact, from ?regexp
"Repetition takes precedence over concatenation, which in turn takes
precedence over alternation. A whole subexpression may be enclosed in
parentheses to override these precedence rules. "
So parentheses *are* in fact "grouping symbols."
On Tue, Sep 6, 2016 at 5:18 PM, Jeff Newmiller
> I am not near my computer today, but each parenthesis gets its own result number, so you should put the parenthesis around the whole pattern of alternatives instead of having many parentheses.
>
> I recommend thinking in terms of what common information you expect to find in these various strings, and place your parentheses to capture that information. There is no other reason to put parentheses in the pattern... they are not grouping symbols.
On September 6, 2016 5:01:04 PM PDT, Bert Gunter
>>Jun:
>>
>>1. Tell us your desired result from your test vector and maybe someone
>>will help.
>>
>>2. As we played this game once already (you couldn't do it; I showed
>>you how), this seems to be a function of your limitations with regular
>>expressions. I'm probably not much better, but in any case, I don't
>>intend to be your consultant. See if you can find someone locally to
>>help you if you do not receive a satisfactory reply from the list.
>>There are many people here who are pretty good at this sort of thing,
>>but I don't know if they'll reply. Regex's are certainly complex. PERL
>>people tend to be pretty good at them, I believe. There are numerous
>>web sites and books on them if you need to acquire expertise for your
>>work.
On Tue, Sep 6, 2016 at 3:59 PM, Jun Shen
>>> Hi Bert,
>>>
>>> I still couldn't make the multiple patterns to work. Here is an
>>example. I
>>> make the pattern as follows
>>>
>>> final.pattern <-
>>"(240\\.m\\.g)\\.(>50-70\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>70-90\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>70-90\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>90-110\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>90-110\\.kg)\\.(.*)|(240\\.m\\.g)\\.(50\\.kg\\.or\\.less)\\.(.*)|(3\\.mg\\.kg)\\.(50\\.kg\\.or\\.less)\\.(.*)|(240\\.m\\.g)\\.(>110\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>110\\.kg)\\.(.*)"
>>> test.string <- c('240.m.g.>110.kg.geo.mean', '3.mg.kg.>110.kg.P05',
>>> '240.m.g.>50-70.kg.geo.mean')
>>>
>>> sub(final.pattern, '\\1', test.string)
>>> sub(final.pattern, '\\2', test.string)
>>> sub(final.pattern, '\\3', test.string)
>>> Only the third string has been correctly parsed, which matches the
>>first
>>> pattern. It seems the rest of the patterns are not called.
>>>
On Mon, Sep 5, 2016 at 10:21 PM, Bert Gunter
>>wrote:
>>>>
>>>> Just noticed: My clumsy do.call() line in my previously posted code
>>>> below should be replaced with:
>>>> pat <- paste(pat,collapse = "|")
>>>>
>>>> > pat <- c(pat1,pat2)
>>>> > paste(pat,collapse="|")
>>>> [1] "a+\\.*a+|b+\\.*b+"
>>>>
On Mon, Sep 5, 2016 at 12:11 PM, Bert Gunter
>>>> > Jun:
>>>> >
>>>> > You need to provide a clear specification via regular expressions
>>of
>>>> > the patterns you wish to match -- at least for me to decipher it.
>>>> > Others may be smarter than I, though...
>>>> >
>>>> > Jeff: Thanks. I have now convinced myself that it can be done (a
>>>> > "proof" of sorts): If pat1, pat2,..., patn are m different
>>patterns
>>>> > (in a vector of patterns) to be matched in a vector of n strings,
>>>> > where only one of the patterns will match in any string, then use
>>>> > paste() (probably via do.call()) or otherwise to paste them
>>together
>>>> > separated by "|" to form the concatenated pattern, pat. Then
>>>> >
>>>> > sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)
>>>> >
>>>> > should extract the matching pattern in each (perhaps with a little
>>>> > fiddling due to precedence rules); e.g.
>>>> >
>>>> >> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")
>>>> >
>>>> >> pat1 <- "a+\\.*a+"
>>>> >> pat2 <-"b+\\.*b+"
>>>> >> pat <- c(pat1,pat2)
>>>> >
>>>> >> pat <- do.call(paste,c(as.list(pat), sep="|"))
>>>> >> pat
>>>> > [1] "a+\\.*a+|b+\\.*b+"
>>>> >
>>>> >> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
>>>> > [1] "a.a" "bb" "b.bbb"
>>>> >
On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen
>>wrote:
>>>> >> Thanks for the reply, Bert.
>>>> >>
>>>> >> Your solution solves the example. I actually have a more general
>>>> >> situation
>>>> >> where I have this dot concatenated string from multiple
>>variables. The
>>>> >> problem is those variables may have values with dots in there.
>>The
>>>> >> number of
>>>> >> dots are not consistent for all values of a variable. So I am
>>thinking
>>>> >> to
>>>> >> define a vector of patterns for the vector of the string and
>>hopefully
>>>> >> to
>>>> >> find a way to use a pattern from the pattern vector for each
>>value of
>>>> >> the
>>>> >> string vector. The only way I can think of is "for" loop, which
>>can be
>>>> >> slow.
>>>> >> Also these are happening in a function I am writing. Just wonder
>>if
>>>> >> there is
>>>> >> another more efficient way. Thanks a lot.
On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter
>>>> >>>
>>>> >>> Well, he did provide an example, and...
>>>> >>>
>>>> >>>
>>>> >>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
>>>> >>>
>>>> >>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
>>>> >>> [1] "WT.CUT" "tx"
>>>> >>>
>>>> >>>
>>>> >>> ## seems to do what was requested.
>>>> >>>
>>>> >>> Jeff would have to amplify on his initial statement however: do
>>you
>>>> >>> mean that separate patterns can always be combined via "|" ? Or
>>>> >>> something deeper?
>>>> >>>
>>>> >>>
On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller
>>>> >>> <jdnewmil at dcn.davis.ca.us>
>>>> >>> wrote:
>>>> >>> > Your opening assertion is false.
>>>> >>> >
>>>> >>> > Provide a reproducible example and someone will demonstrate.
On September 4, 2016 9:06:59 PM PDT, Jun Shen
>>>> >>> > <jun.shen.ut at gmail.com>
>>>> >>> > wrote:
>>>> >>> >>Dear list,
>>>> >>> >>
>>>> >>> >>I have a vector of strings that cannot be described by one
>>pattern.
>>>> >>> >> So
>>>> >>> >>let's say I construct a vector of patterns in the same length
>>as the
>>>> >>> >>vector
>>>> >>> >>of strings, can I do the element wise pattern recognition and
>>string
>>>> >>> >>substitution.
>>>> >>> >>
>>>> >>> >>For example,
>>>> >>> >>
>>>> >>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
>>>> >>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
>>>> >>> >>
>>>> >>> >>patterns <- c(pattern1,pattern2)
>>>> >>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
>>>> >>> >>
>>>> >>> >>Say I want to extract "WT.CUT" from the first string and "tx"
>>from
>>>> >>> >> the
>>>> >>> >>second string. If I do
>>>> >>> >>
>>>> >>> >>sub(patterns, '\\2', strings), only the first pattern will be
>>used.
>>>> >>> >>
>>>> >>> >>looping the patterns doesn't work the way I want. Appreciate
>>any
>>>> >>> >>comments.
>>>> >>> >>Thanks.
>>>> >>> >>
>>>> >>> >>Jun
>>>> >>> >>
>>>> >>> >>
