[R] tm package: handling contractions

Fri Jan 27 15:50:51 CET 2012

I tried making a wordcloud of Obama's State of the Union address using 
the tm package to process the text

sotu <- scan(file="c:/R/data/sotu2012.txt", what="character")
sotu <- tolower(sotu)
corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))

corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
tdm <- TermDocumentMatrix(corp)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

wordcloud(d$word,d$freq)

I ended up with a large number of contractions that were split at the 
"’" character, e.g., "don’t" --> "don'"
e.g.,

 > sotu[grep("’", sotu)]
[1] "qaeda’s" "taliban’s" "america’s" "they’re" "don’t"
[6] "we’re" "aren’t" "we’ve" "patton’s" "what’s"
[11] "let’s" "weren’t," "couldn’t" "people’s" "didn’t"
[16] "we’ve" "we’ve" "we’ve" "i’m" "that’s"
[21] "world’s" "what’s" "can’t" "that’s" "it’s"
[26] "lock’s" "let’s" "you’re" "shouldn’t" "you’re"
[31] "you’re" "it’s" "i’ll" "we’re" "don’t"
[36] "we’ve" "it’s" "it’s" "it’s" "they’re"
...
[201] "didn’t" "bush’s" "didn’t" "can’t" "there’s"
[206] "i’m" "other’s" "we’re"
 >

NB: What appears as the ' character above actually the character hex 92, 
not hex 27 on my Windows system.

This should be a common problem in text processing, but I don't see a 
transformation in the tm package that
handles this nicely. Is there something I've missed?

-Michael

-- 
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    Web:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA