[R] element wise pattern recognition and string substitution

Jun Shen jun.shen.ut at gmail.com
Sat Sep 10 06:06:01 CEST 2016


Hi Jeff,

I have been trying different methods and found your approach is the most
efficient. I am able to resolve the string-parsing problem. Let me report
back to the group.

This following example explains what I was trying to achieve.

melt.results is where the strings reside, testdata is a snippet of data
where the unique values are derived.  replace.metaChar is a function I
defined. Thanks for the help from everyone and appreciate any comment.

Jun
################################################################
melt.results <- structure(list(param = c("Cmin1", "Cminss", "Cmaxss",
"Cmin1",
"Cminss", "Cmin1", "Cminss", "Cmaxss", "Cmin1", "Cminss"), variable =
structure(c(1L,
5L, 9L, 14L, 18L, 21L, 25L, 29L, 34L, 38L), .Label =
c("240.mg.>110.kg.geo.mean",

"240.mg.>110.kg.cv", "240.mg.>110.kg.P05", "240.mg.>110.kg.P95",
"3.mg.kg.>110.kg.geo.mean", "3.mg.kg.>110.kg.cv", "3.mg.kg.>110.kg.P05",
"3.mg.kg.>110.kg.P95", "240.mg.>50-70.kg.geo.mean", "240.mg.>50-70.kg.cv",
"240.mg.>50-70.kg.P05", "240.mg.>50-70.kg.P95", "3.mg.kg.>50-70.kg.geo.mean",

"3.mg.kg.>50-70.kg.cv", "3.mg.kg.>50-70.kg.P05", "3.mg.kg.>50-70.kg.P95",
"240.mg.50.kg.or.less.geo.mean", "240.mg.50.kg.or.less.cv",
"240.mg.50.kg.or.less.P05",
"240.mg.50.kg.or.less.P95", "3.mg.kg.50.kg.or.less.geo.mean",
"3.mg.kg.50.kg.or.less.cv", "3.mg.kg.50.kg.or.less.P05",
"3.mg.kg.50.kg.or.less.P95",
"240.mg.>70-90.kg.geo.mean", "240.mg.>70-90.kg.cv", "240.mg.>70-90.kg.P05",
"240.mg.>70-90.kg.P95", "3.mg.kg.>70-90.kg.geo.mean", "3.mg.kg.>70-90.kg.cv",

"3.mg.kg.>70-90.kg.P05", "3.mg.kg.>70-90.kg.P95", "240.mg.>90-110.kg.geo.mean",

"240.mg.>90-110.kg.cv", "240.mg.>90-110.kg.P05", "240.mg.>90-110.kg.P95",
"3.mg.kg.>90-110.kg.geo.mean", "3.mg.kg.>90-110.kg.cv",
"3.mg.kg.>90-110.kg.P05",

"3.mg.kg.>90-110.kg.P95"), class = "factor"), value = c(97L,
144L, 76L, 137L, 18L, 104L, 92L, 87L, 111L, 41L)), .Names = c("param",
"variable", "value"), row.names = c(1L, 14L, 27L, 40L, 53L, 61L,
74L, 87L, 100L, 113L), class = "data.frame")

testdata <- structure(list(TX = c("240.mg", "3.mg.kg", "240.mg", "3.mg.kg",
"240.mg", "3.mg.kg", "240.mg", "3.mg.kg", "240.mg", "3.mg.kg"
), WTCUT = c(">50-70.kg", ">50-70.kg", ">70-90.kg", ">70-90.kg",
">90-110.kg", ">90-110.kg", "50.kg.or.less", "50.kg.or.less",
">110.kg", ">110.kg")), .Names = c("TX", "WTCUT"), row.names = c(1L,
2L, 7L, 8L, 19L, 20L, 21L, 22L, 129L, 130L), class = "data.frame")

replace.metaChar <- function(string) {
  metaChar <-
c("\\$","\\*","\\+","\\.","\\?","\\[","\\]","\\^","\\{","\\}","\\|","\\(","\\)","\\\\")
  metaReplace <-  paste('\\',metaChar, sep='')
  for(r in seq(metaChar)) gsub(metaChar[r], metaReplace[r], string) ->
string
  return(string)
}

sort.var <- c('TX','WTCUT')

one.pattern <- paste('\\b',paste(sapply(sapply(sort.var,
function(x)replace.metaChar(unique(testdata[[x]]))), function(y)
paste('(',paste(y,collapse='|'),')', sep='')), collapse='\\.'), '\\.(.*)',
sep='')

n.sort.var <- length(sort.var)
one.replacement <- paste('\\', seq(n.sort.var+1), collapse='\t', sep='')
one.results <- strsplit(sub(one.pattern, one.replacement,
melt.results$variable), split='\t')

melt.results[c(sort.var,'STATS')] <- as.data.frame(do.call(rbind,
one.results))

On Wed, Sep 7, 2016 at 3:04 AM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
wrote:

> Here are some suggestions:
>
> test.string <- c( '240.m.g.>110.kg.geo.mean'
>                 , '3.mg.kg.>110.kg.P05'
>                 , '240.m.g.>50-70.kg.geo.mean'
>                 )
> # based on your literal idea
> suggested.pattern1 <-
>   "(240\\.m\\.g|3\\.mg\\.kg)\\.(>50-70\\.kg|>70-90\\.kg|>90-11
> 0\\.kg|50\\.kg\\.or\\.less|>110\\.kg)\\.(.*)"
>
> resultL <- strsplit( sub( suggested.pattern1
>                         , "\\1\t\\2\t\\3"
>                         , test.string )
>                    , split = "\t"
>                    )
>
> # equivalent based on apparent repetitive patterns in your sample data
> suggested.pattern2 <- "(.*?m\\.g|kg)\\.(.*?kg|.*?less)\\.(.*)"
>
> resultL2 <- strsplit( sub( suggested.pattern2
>                          , "\\1\t\\2\t\\3"
>                          , test.string
>                          )
>                     , split = "\t"
>                     )
>
> # put results into an organized table
> DF <- setNames( data.frame( do.call( rbind, resultL ) )
>               , c( "First", "Second", "Third" )
>               )
>
> By the way... please aim to make your examples reproducible. It would have
> been easy for you to define the necessary variables in example form
> rather than sending a non-reproducible example.
>
>
> On Tue, 6 Sep 2016, Jun Shen wrote:
>
> Hi Jeff,
>>
>> Thanks for the reply. I tried your suggestion and it doesn't seem to work
>> and I tried a simple pattern as follows and it works as expected
>>
>> sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\1', "3.mg.kg
>> .>50-70.kg.P05")
>> [1] "3.mg.kg"
>>
>> sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\2', "3.mg.kg
>> .>50-70.kg.P05")
>> [1] ">50-70.kg"
>>
>> sub("(3\\.mg\\.kg)\\.(>50-70\\.kg)\\.(.*)", '\\3', "3.mg.kg
>> .>50-70.kg.P05")
>> [1] "P05"
>>
>> My problem is the pattern has to be dynamically constructed on the input
>> data of the function I am writing. It's actually not too difficult
>> to assemble the final.pattern with some code like the following
>>
>> sort.var <- c('TX','WTCUT')
>> combn.sort.var <- do.call(expand.grid, lapply(sort.var,
>> function(x)paste('(',gsub('\\.','\\\\.',unlist(unique(all.exposure[x]))),
>> ')',
>> sep='')))
>> all.patterns <- do.call(paste, c(combn.sort.var, '(.*)', sep='\\.'))
>> final.pattern <- paste0(all.patterns, collapse='|')
>>
>> You cannot run the code directly since the data object "all.exposure" is
>> not provided here.
>>
>> Jun
>>
>>
>>
>> On Tue, Sep 6, 2016 at 8:18 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
>> wrote:
>>       I am not near my computer today, but each parenthesis gets its own
>> result number, so you should put the parenthesis around the
>>       whole pattern of alternatives instead of having many parentheses.
>>
>>       I recommend thinking in terms of what common information you expect
>> to find in these various strings, and place your parentheses
>>       to capture that information. There is no other reason to put
>> parentheses in the pattern... they are not grouping symbols.
>>       --
>>       Sent from my phone. Please excuse my brevity.
>>
>>       On September 6, 2016 5:01:04 PM PDT, Bert Gunter <
>> bgunter.4567 at gmail.com> wrote:
>>       >Jun:
>>       >
>>       >1. Tell us your desired result from your test vector and maybe
>> someone
>>       >will help.
>>       >
>>       >2. As we played this game once already (you couldn't do it; I
>> showed
>>       >you how), this seems to be a function of your limitations with
>> regular
>>       >expressions. I'm probably not much better, but in any case, I don't
>>       >intend to be your consultant. See if you can find someone locally
>> to
>>       >help you if you do not receive a satisfactory reply from the list.
>>       >There are many people here who are pretty good at this sort of
>> thing,
>>       >but I don't know if they'll reply. Regex's are certainly complex.
>> PERL
>>       >people tend to be pretty good at them, I believe. There are
>> numerous
>>       >web sites and books on them if you need to acquire expertise for
>> your
>>       >work.
>>       >
>>       >Cheers,
>>       >Bert
>>       >Bert Gunter
>>       >
>>       >"The trouble with having an open mind is that people keep coming
>> along
>>       >and sticking things into it."
>>       >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>       >
>>       >
>>       >On Tue, Sep 6, 2016 at 3:59 PM, Jun Shen <jun.shen.ut at gmail.com>
>> wrote:
>>       >> Hi Bert,
>>       >>
>>       >> I still couldn't make the multiple patterns to work. Here is an
>>       >example. I
>>       >> make the pattern as follows
>>       >>
>>       >> final.pattern <-
>>       >>
>> >"(240\\.m\\.g)\\.(>50-70\\.kg)\\.(.*)|(3\\.mg\\.kg)\\.(>50-
>> 70\\.kg)\\.(.*)|(240\\.m\\.g)\\.(>70-90\\.kg)\\.(.*)|(3\\.
>> mg\\.kg)\\.(>70-90\\.k
>> g)\\.(.*)|(240\\.m\\.g)\\.(>90-110\\.kg)\\.(.*)|(3\\.mg\\.kg
>> )\\.(>90-110\\.kg)\\.(.*)|(240\\.m\\.g)\\.(50\\.kg\\.or\\.
>> less)\\.(.*)|(3\\.mg\\
>>       .kg)\\.(50\\.kg\\.or\\.less)\\.(.*)|(240\\.m\\.g)\\.(>110\\.
>> kg)\\.(.*)|(3\\.mg\\.kg)\\.(>110\\.kg)\\.(.*)"
>>       >>
>>       >> test.string <- c('240.m.g.>110.kg.geo.mean', '3.mg.kg
>> .>110.kg.P05',
>>       >> '240.m.g.>50-70.kg.geo.mean')
>>       >>
>>       >> sub(final.pattern, '\\1', test.string)
>>       >> sub(final.pattern, '\\2', test.string)
>>       >> sub(final.pattern, '\\3', test.string)
>>       >>
>>       >> Only the third string has been correctly parsed, which matches
>> the
>>       >first
>>       >> pattern. It seems the rest of the patterns are not called.
>>       >>
>>       >> Jun
>>       >>
>>       >>
>>       >> On Mon, Sep 5, 2016 at 10:21 PM, Bert Gunter <
>> bgunter.4567 at gmail.com>
>>       >wrote:
>>       >>>
>>       >>> Just noticed: My clumsy do.call() line in my previously posted
>> code
>>       >>> below should be replaced with:
>>       >>> pat <- paste(pat,collapse = "|")
>>       >>>
>>       >>>
>>       >>> > pat <- c(pat1,pat2)
>>       >>> > paste(pat,collapse="|")
>>       >>> [1] "a+\\.*a+|b+\\.*b+"
>>       >>>
>>       >>> ************ replace this **************************
>>       >>> > pat <- do.call(paste,c(as.list(pat), sep="|"))
>>       >>> ********************************************
>>       >>> > sub(paste0("^[^b]*(",pat,").*$"),"\\1",z)
>>       >>> [1] "a.a"   "bb"    "b.bbb"
>>       >>>
>>       >>>
>>       >>> -- Bert
>>       >>> Bert Gunter
>>       >>>
>>       >>> "The trouble with having an open mind is that people keep coming
>>       >along
>>       >>> and sticking things into it."
>>       >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
>> strip )
>>       >>>
>>       >>>
>>       >>> On Mon, Sep 5, 2016 at 12:11 PM, Bert Gunter
>>       ><bgunter.4567 at gmail.com>
>>       >>> wrote:
>>       >>> > Jun:
>>       >>> >
>>       >>> > You need to provide a clear specification via regular
>> expressions
>>       >of
>>       >>> > the patterns you wish to match -- at least for me to decipher
>> it.
>>       >>> > Others may be smarter than I, though...
>>       >>> >
>>       >>> > Jeff: Thanks. I have now convinced myself that it can be done
>> (a
>>       >>> > "proof" of sorts): If pat1, pat2,..., patn are m different
>>       >patterns
>>       >>> > (in a vector of patterns)  to be matched in a vector of n
>> strings,
>>       >>> > where only one of the patterns will match in any string,
>> then use
>>       >>> > paste() (probably via do.call()) or otherwise to paste them
>>       >together
>>       >>> > separated by "|" to form the concatenated pattern, pat. Then
>>       >>> >
>>       >>> > sub(paste0("^.*(",pat, ").*$"),"\\1",thevector)
>>       >>> >
>>       >>> > should extract the matching pattern in each (perhaps with a
>> little
>>       >>> > fiddling due to precedence rules); e.g.
>>       >>> >
>>       >>> >> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")
>>       >>> >
>>       >>> >> pat1 <- "a+\\.*a+"
>>       >>> >> pat2 <-"b+\\.*b+"
>>       >>> >> pat <- c(pat1,pat2)
>>       >>> >
>>       >>> >> pat <- do.call(paste,c(as.list(pat), sep="|"))
>>       >>> >> pat
>>       >>> > [1] "a+\\.*a+|b+\\.*b+"
>>       >>> >
>>       >>> >> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)
>>       >>> > [1] "a.a"   "bb"    "b.bbb"
>>       >>> >
>>       >>> > Cheers,
>>       >>> > Bert
>>       >>> >
>>       >>> >
>>       >>> > Bert Gunter
>>       >>> >
>>       >>> > "The trouble with having an open mind is that people keep
>> coming
>>       >along
>>       >>> > and sticking things into it."
>>       >>> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic
>> strip )
>>       >>> >
>>       >>> >
>>       >>> > On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <
>> jun.shen.ut at gmail.com>
>>       >wrote:
>>       >>> >> Thanks for the reply, Bert.
>>       >>> >>
>>       >>> >> Your solution solves the example. I actually have a more
>> general
>>       >>> >> situation
>>       >>> >> where I have this dot concatenated string from multiple
>>       >variables. The
>>       >>> >> problem is those variables may have values with dots in
>> there.
>>       >The
>>       >>> >> number of
>>       >>> >> dots are not consistent for all values of a variable. So I am
>>       >thinking
>>       >>> >> to
>>       >>> >> define a vector of patterns for the vector of the string and
>>       >hopefully
>>       >>> >> to
>>       >>> >> find a way to use a pattern from the pattern vector for each
>>       >value of
>>       >>> >> the
>>       >>> >> string vector. The only way I can think of is "for" loop,
>> which
>>       >can be
>>       >>> >> slow.
>>       >>> >> Also these are happening in a function I am writing. Just
>> wonder
>>       >if
>>       >>> >> there is
>>       >>> >> another more efficient way. Thanks a lot.
>>       >>> >>
>>       >>> >> Jun
>>       >>> >>
>>       >>> >> On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter
>>       ><bgunter.4567 at gmail.com>
>>       >>> >> wrote:
>>       >>> >>>
>>       >>> >>> Well, he did provide an example, and...
>>       >>> >>>
>>       >>> >>>
>>       >>> >>> > z <- c('TX.WT.CUT.mean','mg.tx.cv')
>>       >>> >>>
>>       >>> >>> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)
>>       >>> >>> [1] "WT.CUT" "tx"
>>       >>> >>>
>>       >>> >>>
>>       >>> >>> ## seems to do what was requested.
>>       >>> >>>
>>       >>> >>> Jeff would have to amplify on his initial statement
>> however: do
>>       >you
>>       >>> >>> mean that separate patterns can always be combined via "|"
>> ?  Or
>>       >>> >>> something deeper?
>>       >>> >>>
>>       >>> >>> Cheers,
>>       >>> >>> Bert
>>       >>> >>> Bert Gunter
>>       >>> >>>
>>       >>> >>> "The trouble with having an open mind is that people keep
>> coming
>>       >along
>>       >>> >>> and sticking things into it."
>>       >>> >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
>> strip
>>       >)
>>       >>> >>>
>>       >>> >>>
>>       >>> >>> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller
>>       >>> >>> <jdnewmil at dcn.davis.ca.us>
>>       >>> >>> wrote:
>>       >>> >>> > Your opening assertion is false.
>>       >>> >>> >
>>       >>> >>> > Provide a reproducible example and someone will
>> demonstrate.
>>       >>> >>> > --
>>       >>> >>> > Sent from my phone. Please excuse my brevity.
>>       >>> >>> >
>>       >>> >>> > On September 4, 2016 9:06:59 PM PDT, Jun Shen
>>       >>> >>> > <jun.shen.ut at gmail.com>
>>       >>> >>> > wrote:
>>       >>> >>> >>Dear list,
>>       >>> >>> >>
>>       >>> >>> >>I have a vector of strings that cannot be described by one
>>       >pattern.
>>       >>> >>> >> So
>>       >>> >>> >>let's say I construct a vector of patterns in the same
>> length
>>       >as the
>>       >>> >>> >>vector
>>       >>> >>> >>of strings, can I do the element wise pattern recognition
>> and
>>       >string
>>       >>> >>> >>substitution.
>>       >>> >>> >>
>>       >>> >>> >>For example,
>>       >>> >>> >>
>>       >>> >>> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)"
>>       >>> >>> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)"
>>       >>> >>> >>
>>       >>> >>> >>patterns <- c(pattern1,pattern2)
>>       >>> >>> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv')
>>       >>> >>> >>
>>       >>> >>> >>Say I want to extract "WT.CUT" from the first string and
>> "tx"
>>       >from
>>       >>> >>> >> the
>>       >>> >>> >>second string. If I do
>>       >>> >>> >>
>>       >>> >>> >>sub(patterns, '\\2', strings), only the first pattern
>> will be
>>       >used.
>>       >>> >>> >>
>>       >>> >>> >>looping the patterns doesn't work the way I want.
>> Appreciate
>>       >any
>>       >>> >>> >>comments.
>>       >>> >>> >>Thanks.
>>       >>> >>> >>
>>       >>> >>> >>Jun
>>       >>> >>> >>
>>       >>> >>> >>       [[alternative HTML version deleted]]
>>       >>> >>> >>
>>       >>> >>> >>______________________________________________
>>       >>> >>> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and
>> more,
>>       >see
>>       >>> >>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>>       >>> >>> >>PLEASE do read the posting guide
>>       >>> >>> >>http://www.R-project.org/posting-guide.html
>>       >>> >>> >>and provide commented, minimal, self-contained,
>> reproducible
>>       >code.
>>       >>> >>> >
>>       >>> >>> > ______________________________________________
>>       >>> >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
>> more,
>>       >see
>>       >>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>       >>> >>> > PLEASE do read the posting guide
>>       >>> >>> > http://www.R-project.org/posting-guide.html
>>       >>> >>> > and provide commented, minimal, self-contained,
>> reproducible
>>       >code.
>>       >>> >>
>>       >>> >>
>>       >>
>>       >>
>>
>>
>>
>>
>>
> ------------------------------------------------------------
> ---------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ------------------------------------------------------------
> ---------------

	[[alternative HTML version deleted]]



More information about the R-help mailing list