[R] Help with stemDocument

Triss.Ashton triss.ashton at unt.edu
Fri May 11 02:12:09 CEST 2012


Alekseiy, I tried your recommendation with several variations. It still does
not run.  I think the problem has to do with R2.15 and the refreshed TM
package.  Everything runs under R2.10 with the following code:

a <- Corpus(VectorSource(df$text)) # create corpus object
a <- tm_map(a, removePunctuation)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stripWhitespace)		
a <- tm_map(a, stemDocument, language = "english") 


This same code ran on R2.15 results in:
1. the removeWords working sometimes, and sometimes not.
2. and stemDocuments absolutely not working.  

Both error out.  removeWords always stops reading in the stopword list on
the same line number  (I have added and subtracted words - no difference) -
session info is below:

> a <- tm_map(a, removeWords, stopwords("english"))

Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "",  : 
  invalid regular expression
'\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he


> a <- tm_map(a, stemDocument, language = "english") 
Error in .jnew(name) : java.lang.ClassNotFoundException

SessionInfo:

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] topicmodels_0.1-5 slam_0.1-23       modeltools_0.2-19 lasso2_1.2-12    
 [5] pvclust_1.2-2     stringr_0.6       plyr_1.7.1        Snowball_0.0-8   
 [9] rJava_0.9-3       ggplot2_0.9.0     tm_0.5-7.1        twitteR_0.99.19  
[13] rjson_0.2.8       RCurl_1.91-1.1    bitops_1.0-4.1   

loaded via a namespace (and not attached):
 [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       MASS_7.3-17       
 [5] memoise_0.1        munsell_0.3        proto_0.3-9.2     
RColorBrewer_1.0-5
 [9] reshape2_1.2.1     RWeka_0.4-11       RWekajars_3.7.5-1  scales_0.2.0      
> 
Hi Triss, 

If you need to stem just one text in the Corupus use a[[n]]<-stemDocument

Best,
-Alex
________________________________________
From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton
[triss.ashton@]
Sent: 02 May 2012 21:09
To: r-help@
Subject: Re: [R] Help with stemDocument

I am having a problem with stemDocuments also.  I can make it work by moving
the data into a Corpus by using:

>  a <- Corpus(VectorSource(df$text)) # create corpus object
>  a <- tm_map(a, stemDocument, language = "english")

but it is horrably slow.  I want to stem outside the Corpus object like:

>df$text <- stemDocument(df$text, language = "english")

but it returns the original text.

In fact, using the example in the tm package documentation does not work
either:

> data("crude")
> crude[[1]]
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]], language = "english") # specify language
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
> stemDocument(crude[[1]]) # language not specified
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
>


--
View this message in context:
http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
Sent from the R help mailing list archive at Nabble.com.



--
View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list