[R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

Greg Snow Greg.Snow at intermountainmail.org
Tue Jul 25 18:56:37 CEST 2006


Using regular expression matching for this case may be overkill (the RE
engine will be doing a lot of backtracking looking at a lot of
non-matches).  Here is an alternative that splits the text into a vector
of words, extracts the last 2 letters of each word (remember if the last
3 letters match, then the last 2 have to match, so we only need to
consider the last 2), then looks at all pairwise comparisons for
matches, then pastes everything back together with the marked matches:

text<-"And this is a second rand  sentence"

tmp1 <- strsplit(text, ' ')[[1]]
tmp2 <- nchar(tmp1)
tmp3 <- substr(tmp1,tmp2-1,tmp2)

tmp4 <- which(lower.tri(diag(length(tmp3))), arr.ind=TRUE)
tmp5 <- tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

tmp6 <- rep('', length(tmp1))
count <- 1
for( i in which(tmp5) ){
	tmp6[ tmp4[i,1] ] <- paste(tmp6[ tmp4[i,1] ],
'<r',count,'>',sep='')
	tmp6[ tmp4[i,2] ] <- paste(tmp6[ tmp4[i,2] ],
'<r',count,'>',sep='')
	count <- count + 1
}

out.text <- paste( tmp1,tmp6, sep='',collapse=' ')


If you are doing a lot of text processing like this, I would suggest
doing it in Perl rather than R.  S Poetry by Dr. Burns has a function to
take a vector of character strings in R and run a Perl script on it and
return the results.

Hope this helps,




-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Stefan Th. Gries
Sent: Saturday, July 22, 2006 7:49 PM
To: r-help at stat.math.ethz.ch
Subject: [R] RfW 2.3.1: regular expressions to detect pairs of identical
word-final character sequences

Dear all

I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine
and I have two related regular expression problems.

platform       i386-pc-mingw32           
arch           i386                      
os             mingw32                   
system         i386, mingw32             
status                                   
major          2                         
minor          3.1                       
year           2006                      
month          06                        
day            01                        
svn rev        38247                     
language       R                         
version.string Version 2.3.1 (2006-06-01)


I would like to find cases of words in elements of character vectors
that end in the same character sequences; if I find such cases, I want
to add <r> to both potentially rhyming sequences. An example:

INPUT:This is my dog.
DESIRED OUTPUT: This<r> is<r> my dog.

I found a solution for cases where the potentially rhyming words are
adjacent:

text<-"This is my dog."
gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)

However, with another text vector, I came across two problems I cannot
seem to solve and for which I would love to get some input.

(i) While I know what to do for non-adjacent words in general

gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my
dog", perl=TRUE) # I know this is not proper English ;-)

this runs into problems with overlapping matches:

text<-"And this is the second sentence"
gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
[1] "And<r> this is the second<r> sentence"

It finds the "nd" match, but since the "is" match is within the two
"nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches?

(ii) How would one tell R to match only when there are 2+ characters
matching? If the above expression is applied to another character string

text<-"this is an example sentence."
gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)

it also matches the "e"'s at the end of example and sentence. It's not
possible to get rid of that by specifying a range such as {2,}

text<-"this is an example sentence."
gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text,
perl=TRUE)

because, as I understand it, this requires the 2+ cases of \\w to be
identical characters:

text<-"doo yoo see mee?"
gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text,
perl=TRUE)

Again, any ideas?

I'd really appreciate any snippets of codes, pointers, etc.
Thanks so much,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list