[R] Troubles with stemming (tm + Snowball packages) under MacOS

Julien Velcin julien.velcin at univ-lyon2.fr
Fri Jan 13 15:49:22 CET 2012


Dear all,

I have some troubles using the stemming algorithm provided by the tm  
(text mining) + Snowball packages.
Here is my config:

MacOS 10.5
R 2.12.0 / R 2.13.1 / R 2.14.1 (I have tried several versions)

I have installed all the needed packages (tm, rJava, rWeka, Snowball)  
+ dependencies. I have desactivated AWT (like written in http://r.789695.n4.nabble.com/Problem-with-Snowball-amp-RWeka-td3402126.html) 
  with :

Sys.setenv(NOAWT=TRUE)

The command tm_map(reuters, stemDocument) gives the following errors :

- First time:
Error in .jnew(name) :
   java.lang.InternalError: Can't start the AWT because Java was  
started on the first thread.  Make sure StartOnFirstThread is not  
specified in your application's Info.plist or on the command line
Refreshing GOE props...

- Second time:
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
Stemmer 'porter' unknown!
Stemmer 'english' unknown!
(etc.)

I have already search the Web for a solution, but I have found nothing  
useful.

Here is the full source code (all the librairies are already loaded):
------
Sys.setenv(NOAWT=TRUE)
source <- ReutersSource("reuters-21578.xml", encoding="UTF-8")
reuters <- Corpus(source)
reuters <- tm_map(reuters, as.PlainTextDocument)
reuters <- tm_map(reuters, removePunctuation)
reuters <- tm_map(reuters, tolower)
reuters <- tm_map(reuters, removeWords, stopwords("english"))
reuters <- tm_map(reuters, removeNumbers)
reuters <- tm_map(reuters, stripWhitespace)
reuters <- tm_map(reuters, stemDocument)
------

Thank you for your help,

Julien



More information about the R-help mailing list