[R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

Sun Jul 23 06:05:26 CEST 2006

The following requires more than just a single gsub but it does solve
the problem.  Modify to suit.

The first gsub places <...> around the first occurrence of any
duplicated suffixes.  We use the (?=...) zero width regexp
to circumvent the nesting problem.

Then we use strapply from the gsubfn package to extract
the suffixes so marked and paste them together to pass
to a second gsub which locates them in the original
string appending an <r> to each.   Uncomment the commented
pat if you only want to match 2+ character suffixes.

library(gsubfn)
# places <...> around first occurrences of repeated suffixes
text <- "And this is the second sentence"
pat <- "(\\w+)(?=\\b.+\\1\\b)"
# pat <- "(\\w\\w+)(?=\\b.+\\1\\b)"
out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE)

suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]]
gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text)

On 7/22/06, Stefan Th. Gries <stgries_lists at arcor.de> wrote:
> Dear all
>
> I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems.
>
> platform       i386-pc-mingw32
> arch           i386
> os             mingw32
> system         i386, mingw32
> status
> major          2
> minor          3.1
> year           2006
> month          06
> day            01
> svn rev        38247
> language       R
> version.string Version 2.3.1 (2006-06-01)
>
>
> I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example:
>
> INPUT:This is my dog.
> DESIRED OUTPUT: This<r> is<r> my dog.
>
> I found a solution for cases where the potentially rhyming words are adjacent:
>
> text<-"This is my dog."
> gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input.
>
> (i) While I know what to do for non-adjacent words in general
>
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-)
>
> this runs into problems with overlapping matches:
>
> text<-"And this is the second sentence"
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> [1] "And<r> this is the second<r> sentence"
>
> It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches?
>
> (ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string
>
> text<-"this is an example sentence."
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,}
>
> text<-"this is an example sentence."
> gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> because, as I understand it, this requires the 2+ cases of \\w to be identical characters:
>
> text<-"doo yoo see mee?"
> gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> Again, any ideas?
>
> I'd really appreciate any snippets of codes, pointers, etc.
> Thanks so much,
> STG
> --
> Stefan Th. Gries
> -----------------------------------------------
> University of California, Santa Barbara
> http://www.linguistics.ucsb.edu/faculty/stgries
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>