Gabor Grothendieck ggrothendieck at gmail.com
Sun Jul 23 06:05:26 CEST 2006

The following requires more than just a single gsub but it does solve
the problem.  Modify to suit.

The first gsub places <...> around the first occurrence of any
duplicated suffixes.  We use the (?=...) zero width regexp
to circumvent the nesting problem.

Then we use strapply from the gsubfn package to extract
the suffixes so marked and paste them together to pass
to a second gsub which locates them in the original
string appending an <r> to each.   Uncomment the commented
pat if you only want to match 2+ character suffixes.

# places <...> around first occurrences of repeated suffixes
text <- "And this is the second sentence"
pat <- "(\\w+)(?=\\b.+\\1\\b)"
# pat <- "(\\w\\w+)(?=\\b.+\\1\\b)"
out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE)

suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]]
gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text)

On 7/22/06, Stefan Th. Gries <stgries_lists at arcor.de> wrote:
> Dear all
> I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems.
> I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example:
> INPUT:This is my dog.
> DESIRED OUTPUT: This<r> is<r> my dog.
> I found a solution for cases where the potentially rhyming words are adjacent:
> text<-"This is my dog."
> gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input.
> (i) While I know what to do for non-adjacent words in general
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-)
> this runs into problems with overlapping matches:
> text<-"And this is the second sentence"
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> [1] "And<r> this is the second<r> sentence"
> It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches?
> (ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string
> text<-"this is an example sentence."
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,}
> text<-"this is an example sentence."
> gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> because, as I understand it, this requires the 2+ cases of \\w to be identical characters:
> text<-"doo yoo see mee?"
> gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> Again, any ideas?
> I'd really appreciate any snippets of codes, pointers, etc.
> Thanks so much,
