[R] Correct way to use a lemmatizer in package tm
Ashim Kapoor
@@h|mk@poor @end|ng |rom gm@||@com
Wed Aug 14 11:09:39 CEST 2019
Dear All,
I want to do lemmatization using the tm package and textstem package.
The following is how I am doing it currently :-
library("tm")
library("wordcloud")
library("RColorBrewer")
filePath = < Path to any text file >
text <- readLines(filePath)
docs <- Corpus(VectorSource(text))
# Convert the text to lower case
docs <- tm_map(docs,content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text Lemmatization
library(textstem)
docs <- tm_map(docs, content_transformer(lemmatize_words))
My query : Is the above line the correct way to do lemmatization ? Can
someone please confirm?
For the sake of giving a complete example I am giving the following code as
well.
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Thank you,
Ashim
[[alternative HTML version deleted]]
More information about the R-help
mailing list