[R] tm package: handling contractions
Michael Friendly
friendly at yorku.ca
Fri Jan 27 15:50:51 CET 2012
I tried making a wordcloud of Obama's State of the Union address using
the tm package to process the text
sotu <- scan(file="c:/R/data/sotu2012.txt", what="character")
sotu <- tolower(sotu)
corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
tdm <- TermDocumentMatrix(corp)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq)
I ended up with a large number of contractions that were split at the
"’" character, e.g., "don’t" --> "don'"
e.g.,
> sotu[grep("’", sotu)]
[1] "qaeda’s" "taliban’s" "america’s" "they’re" "don’t"
[6] "we’re" "aren’t" "we’ve" "patton’s" "what’s"
[11] "let’s" "weren’t," "couldn’t" "people’s" "didn’t"
[16] "we’ve" "we’ve" "we’ve" "i’m" "that’s"
[21] "world’s" "what’s" "can’t" "that’s" "it’s"
[26] "lock’s" "let’s" "you’re" "shouldn’t" "you’re"
[31] "you’re" "it’s" "i’ll" "we’re" "don’t"
[36] "we’ve" "it’s" "it’s" "it’s" "they’re"
...
[201] "didn’t" "bush’s" "didn’t" "can’t" "there’s"
[206] "i’m" "other’s" "we’re"
>
NB: What appears as the ' character above actually the character hex 92,
not hex 27 on my Windows system.
This should be a common problem in text processing, but I don't see a
transformation in the tm package that
handles this nicely. Is there something I've missed?
-Michael
--
Michael Friendly Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street Web: http://www.datavis.ca
Toronto, ONT M3J 1P3 CANADA
More information about the R-help
mailing list