[R] Troubles with stemming (tm + Snowball packages) under MacOS
Julien Velcin
julien.velcin at univ-lyon2.fr
Fri Jan 13 15:49:22 CET 2012
Dear all,
I have some troubles using the stemming algorithm provided by the tm
(text mining) + Snowball packages.
Here is my config:
MacOS 10.5
R 2.12.0 / R 2.13.1 / R 2.14.1 (I have tried several versions)
I have installed all the needed packages (tm, rJava, rWeka, Snowball)
+ dependencies. I have desactivated AWT (like written in http://r.789695.n4.nabble.com/Problem-with-Snowball-amp-RWeka-td3402126.html)
with :
Sys.setenv(NOAWT=TRUE)
The command tm_map(reuters, stemDocument) gives the following errors :
- First time:
Error in .jnew(name) :
java.lang.InternalError: Can't start the AWT because Java was
started on the first thread. Make sure StartOnFirstThread is not
specified in your application's Info.plist or on the command line
Refreshing GOE props...
- Second time:
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
(etc.)
I have already search the Web for a solution, but I have found nothing
useful.
Here is the full source code (all the librairies are already loaded):
------
Sys.setenv(NOAWT=TRUE)
source <- ReutersSource("reuters-21578.xml", encoding="UTF-8")
reuters <- Corpus(source)
reuters <- tm_map(reuters, as.PlainTextDocument)
reuters <- tm_map(reuters, removePunctuation)
reuters <- tm_map(reuters, tolower)
reuters <- tm_map(reuters, removeWords, stopwords("english"))
reuters <- tm_map(reuters, removeNumbers)
reuters <- tm_map(reuters, stripWhitespace)
reuters <- tm_map(reuters, stemDocument)
------
Thank you for your help,
Julien
More information about the R-help
mailing list