[R] tm package: handling contractions
Milan Bouchet-Valat
nalimilan at club.fr
Fri Jan 27 19:07:05 CET 2012
Le vendredi 27 janvier 2012 à 09:50 -0500, Michael Friendly a écrit :
> I tried making a wordcloud of Obama's State of the Union address using
> the tm package to process the text
>
> sotu <- scan(file="c:/R/data/sotu2012.txt", what="character")
> sotu <- tolower(sotu)
> corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))
>
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stemDocument)
> corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
> tdm <- TermDocumentMatrix(corp)
> m <- as.matrix(tdm)
> v <- sort(rowSums(m),decreasing=TRUE)
> d <- data.frame(word = names(v),freq=v)
>
> wordcloud(d$word,d$freq)
>
> I ended up with a large number of contractions that were split at the
> "’" character, e.g., "don’t" --> "don'"
> e.g.,
>
> > sotu[grep("’", sotu)]
> [1] "qaeda’s" "taliban’s" "america’s" "they’re" "don’t"
> [6] "we’re" "aren’t" "we’ve" "patton’s" "what’s"
> [11] "let’s" "weren’t," "couldn’t" "people’s" "didn’t"
> [16] "we’ve" "we’ve" "we’ve" "i’m" "that’s"
> [21] "world’s" "what’s" "can’t" "that’s" "it’s"
> [26] "lock’s" "let’s" "you’re" "shouldn’t" "you’re"
> [31] "you’re" "it’s" "i’ll" "we’re" "don’t"
> [36] "we’ve" "it’s" "it’s" "it’s" "they’re"
> ...
> [201] "didn’t" "bush’s" "didn’t" "can’t" "there’s"
> [206] "i’m" "other’s" "we’re"
> >
>
> NB: What appears as the ' character above actually the character hex 92,
> not hex 27 on my Windows system.
>
> This should be a common problem in text processing, but I don't see a
> transformation in the tm package that
> handles this nicely. Is there something I've missed?
What result would you expect? As I see it, ideally, removePunctuation()
would remove these apostrophes. Looks like it doesn't; the code is:
removePunctuation <- function(x) UseMethod("removePunctuation", x)
removePunctuation.PlainTextDocument <- function(x) gsub("[[:punct:]]+",
" ", x)
And ?regexp says:
‘[:punct:]’ Punctuation characters:
‘! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { |
} ~’.
Maybe the ’ apostrophe should be added to the list? (FWIW, it's the
"real" character for apostrophe in Unicode.)
I discussed a related issue about apostrophes with Ingo Feinerer and
Kurt Hornik: in French, we'd need apostrophes (of any type, ' or ’) to
mark a separation between words, instead of concatenating the two parts
surrounding it. The conclusion was that a language-specific processor
was required (languages with non-latin alphabet have many more diacritic
characters we don't even know about).
In English, I suspect it might be interesting to detect forms like "'re"
or "'nt" and replace them with their full equivalent, i.e. "are" and
"not"; OTOH, genitive forms would probably better be removed (at least
by default). In the short term, Tyler's solution will work, but beware
that "we're" will become "were" if you remove punctuation ;-). An
alternative is to replace apostrophes with spaces so that suffixes are
considered as separate words (that's what I do in French ATM).
Hope this helps
More information about the R-help
mailing list