[R] Help with stemDocument
Triss.Ashton
triss.ashton at unt.edu
Fri May 18 17:30:27 CEST 2012
Thanks Milan, it is running now. It seems part of the problem, as you
suggested were the packages. It seems that although I just installed Rweka,
Snowball and the like they were out of date. So updataing fixed
stemDocument. As for removeWords, that began working once I cut my data in
half. Apparently there are some memory management issues I have yet to
figure out. Thanks again for the help.
Triss
Milan Bouchet-Valat wrote
>
> Le jeudi 10 mai 2012 à 17:12 -0700, Triss.Ashton a écrit :
>> Alekseiy, I tried your recommendation with several variations. It still
>> does
>> not run. I think the problem has to do with R2.15 and the refreshed TM
>> package.
> It works here with R 2.15.0 and tm 0.5-7.2 (development version), all
> other relevant packages of the same version as you (but on Linux 64
> bits). So it might not be the problem.
>
> I'm using the docs example as a test:
> data("crude")
> crude[[1]]
> stemDocument(crude[[1]])
>
>> Everything runs under R2.10 with the following code:
>>
>> a <- Corpus(VectorSource(df$text)) # create corpus object
>> a <- tm_map(a, removePunctuation)
>> a <- tm_map(a, removeNumbers)
>> a <- tm_map(a, removeWords, stopwords("english"))
>> a <- tm_map(a, stripWhitespace)
>> a <- tm_map(a, stemDocument, language = "english")
> Let's focus on the example from the docs, since it's simple. Anyway, you
> example is not reproducible since you do not provide the original data.
>
>>
>> This same code ran on R2.15 results in:
>> 1. the removeWords working sometimes, and sometimes not.
>> 2. and stemDocuments absolutely not working.
>>
>> Both error out. removeWords always stops reading in the stopword list on
>> the same line number (I have added and subtracted words - no difference)
>> -
>> session info is below:
>>
>> > a <- tm_map(a, removeWords, stopwords("english"))
>>
>> Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", :
>> invalid regular expression
>> '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he
>>
>>
>> > a <- tm_map(a, stemDocument, language = "english")
>> Error in .jnew(name) : java.lang.ClassNotFoundException
> This error suggests you should reconfigure Java. Have you tried
> reinstalling rJava, Snowball, RWekajars and RWeka?
>
>> SessionInfo:
>>
>> > sessionInfo()
>> R version 2.15.0 (2012-03-30)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252
>> [2] LC_CTYPE=English_United States.1252
>> [3] LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C
>> [5] LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] stats4 grid stats graphics grDevices utils datasets
>> [8] methods base
>>
>> other attached packages:
>> [1] topicmodels_0.1-5 slam_0.1-23 modeltools_0.2-19 lasso2_1.2-12
>> [5] pvclust_1.2-2 stringr_0.6 plyr_1.7.1 Snowball_0.0-8
>> [9] rJava_0.9-3 ggplot2_0.9.0 tm_0.5-7.1
>> twitteR_0.99.19
>> [13] rjson_0.2.8 RCurl_1.91-1.1 bitops_1.0-4.1
>>
>> loaded via a namespace (and not attached):
>> [1] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 MASS_7.3-17
>> [5] memoise_0.1 munsell_0.3 proto_0.3-9.2
>> RColorBrewer_1.0-5
>> [9] reshape2_1.2.1 RWeka_0.4-11 RWekajars_3.7.5-1
>> scales_0.2.0
>> >
>> Hi Triss,
>>
>> If you need to stem just one text in the Corupus use a[[n]]<-stemDocument
>>
>> Best,
>> -Alex
>> ________________________________________
>> From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton
>> [triss.ashton@]
>> Sent: 02 May 2012 21:09
>> To: r-help@
>> Subject: Re: [R] Help with stemDocument
>>
>> I am having a problem with stemDocuments also. I can make it work by
>> moving
>> the data into a Corpus by using:
>>
>> > a <- Corpus(VectorSource(df$text)) # create corpus object
>> > a <- tm_map(a, stemDocument, language = "english")
>>
>> but it is horrably slow. I want to stem outside the Corpus object like:
>>
>> >df$text <- stemDocument(df$text, language = "english")
>>
>> but it returns the original text.
>>
>> In fact, using the example in the tm package documentation does not work
>> either:
>>
>> > data("crude")
>> > crude[[1]]
>> Diamond Shamrock Corp said that
>> effective today it had cut its contract prices for crude oil by
>> 1.50 dlrs a barrel.
>> The reduction brings its posted price for West Texas
>> Intermediate to 16.00 dlrs a barrel, the copany said.
>> "The price reduction today was made in the light of falling
>> oil product prices and a weak crude oil market," a company
>> spokeswoman said.
>> Diamond is the latest in a line of U.S. oil companies that
>> have cut its contract, or posted, prices over the last two days
>> citing weak oil markets.
>> Reuter
>> > stemDocument(crude[[1]], language = "english") # specify language
>> Diamond Shamrock Corp said that
>> effective today it had cut its contract prices for crude oil by
>> 1.50 dlrs a barrel.
>> The reduction brings its posted price for West Texas
>> Intermediate to 16.00 dlrs a barrel, the copany said.
>> "The price reduction today was made in the light of falling
>> oil product prices and a weak crude oil market," a company
>> spokeswoman said.
>> Diamond is the latest in a line of U.S. oil companies that
>> have cut its contract, or posted, prices over the last two days
>> citing weak oil markets.
>> Reuter
>> > stemDocument(crude[[1]]) # language not specified
>> Diamond Shamrock Corp said that
>> effective today it had cut its contract prices for crude oil by
>> 1.50 dlrs a barrel.
>> The reduction brings its posted price for West Texas
>> Intermediate to 16.00 dlrs a barrel, the copany said.
>> "The price reduction today was made in the light of falling
>> oil product prices and a weak crude oil market," a company
>> spokeswoman said.
>> Diamond is the latest in a line of U.S. oil companies that
>> have cut its contract, or posted, prices over the last two days
>> citing weak oil markets.
>> Reuter
>> >
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help@ mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help@ mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4630523.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list