[R] tm package: handling contractions

Fri Jan 27 19:07:05 CET 2012

Le vendredi 27 janvier 2012 à 09:50 -0500, Michael Friendly a écrit :
> I tried making a wordcloud of Obama's State of the Union address using 
> the tm package to process the text
> 
> sotu <- scan(file="c:/R/data/sotu2012.txt", what="character")
> sotu <- tolower(sotu)
> corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))
> 
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stemDocument)
> corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
> tdm <- TermDocumentMatrix(corp)
> m <- as.matrix(tdm)
> v <- sort(rowSums(m),decreasing=TRUE)
> d <- data.frame(word = names(v),freq=v)
> 
> wordcloud(d$word,d$freq)
> 
> I ended up with a large number of contractions that were split at the 
> "’" character, e.g., "don’t" --> "don'"
> e.g.,
> 
>  > sotu[grep("’", sotu)]
> [1] "qaeda’s" "taliban’s" "america’s" "they’re" "don’t"
> [6] "we’re" "aren’t" "we’ve" "patton’s" "what’s"
> [11] "let’s" "weren’t," "couldn’t" "people’s" "didn’t"
> [16] "we’ve" "we’ve" "we’ve" "i’m" "that’s"
> [21] "world’s" "what’s" "can’t" "that’s" "it’s"
> [26] "lock’s" "let’s" "you’re" "shouldn’t" "you’re"
> [31] "you’re" "it’s" "i’ll" "we’re" "don’t"
> [36] "we’ve" "it’s" "it’s" "it’s" "they’re"
> ...
> [201] "didn’t" "bush’s" "didn’t" "can’t" "there’s"
> [206] "i’m" "other’s" "we’re"
>  >
> 
> NB: What appears as the ' character above actually the character hex 92, 
> not hex 27 on my Windows system.
> 
> This should be a common problem in text processing, but I don't see a 
> transformation in the tm package that
> handles this nicely. Is there something I've missed?
What result would you expect? As I see it, ideally, removePunctuation()
would remove these apostrophes. Looks like it doesn't; the code is:

removePunctuation <- function(x) UseMethod("removePunctuation", x)
removePunctuation.PlainTextDocument <- function(x) gsub("[[:punct:]]+",
" ", x)

And ?regexp says:
     ‘[:punct:]’ Punctuation characters:
          ‘! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { |
          } ~’.

Maybe the ’ apostrophe should be added to the list? (FWIW, it's the
"real" character for apostrophe in Unicode.)

I discussed a related issue about apostrophes with Ingo Feinerer and
Kurt Hornik: in French, we'd need apostrophes (of any type, ' or ’) to
mark a separation between words, instead of concatenating the two parts
surrounding it. The conclusion was that a language-specific processor
was required (languages with non-latin alphabet have many more diacritic
characters we don't even know about).

In English, I suspect it might be interesting to detect forms like "'re"
or "'nt" and replace them with their full equivalent, i.e. "are" and
"not"; OTOH, genitive forms would probably better be removed (at least
by default). In the short term, Tyler's solution will work, but beware
that "we're" will become "were" if you remove punctuation ;-). An
alternative is to replace apostrophes with spaces so that suffixes are
considered as separate words (that's what I do in French ATM).

Hope this helps