[R] Text Mining in R

Burhan ul haq ulhaqz at gmail.com
Tue May 17 18:44:57 CEST 2016


Hi,

Wishing you all well.

I am exploring text mining with R. Here is where I need help:

1. The starting point is a data frame

worder1<- c("I am, taking 2","are these the three samples?",
            "He speaks differently to you, aint it !","This is distilled -
my dear, now give me $3","I saved 2500 this month.")
df1 <- data.frame(id=1:5, words=worder1)

here in dput format:

dput(df1)
structure(list(id = 1:5, words = structure(c(3L, 1L, 2L, 5L,
4L), .Label = c("are these the three samples?", "He speaks differently to
you, aint it !",
"I am, taking 2", "I saved 2500 this month.", "This is distilled - my dear,
now give me $3"
), class = "factor")), .Names = c("id", "words"), row.names = c(NA,
-5L), class = "data.frame")


2. The corpus rituals ...

corp1 <- Corpus(VectorSource(df1$words))
inspect(corp1)
class(corp1)

corp1 <- tm_map(corp1, removeNumbers)
corp1 <- tm_map(corp1, removePunctuation)
corp1 <- tm_map(corp1, removeWords, stopwords("english"))
corp1 <- tm_map(corp1, stripWhitespace)
class(corp1)


3. Getting to the analysis

tdm1 <- TermDocumentMatrix(corp1)
inspect(tdm1[1:5,])
dtm1 <- DocumentTermMatrix(corp1)
inspect(dtm1[1:5,])

4. Now here is the problem

If I do a translation, not in getTransformations(), I am unable to convert
to tdm or dtm

corp1 <- tm_map(corp1, tolower)
class(corp1)
tdm1.2 <- TermDocumentMatrix(corp1)
dtm1.2 <- DocumentTermMatrix(corp1)

The error returned is:

Error: inherits(doc, "TextDocument") is not TRUE

5. The explaination on internet suggests either

a) corp1 <- tm_map(corp1, content_transformer(tolower))
which in my case returns error:
Error in UseMethod("content", x) :
  no applicable method for 'content' applied to an object of class
"character"

b) corpus_clean <- tm_map(corp1, PlainTextDocument)
which results in loss of all the meta data

I will appreciate any help. Lastly to keep the doc ids with R corpus,
should the step 2 be changed as:
corp1 <- Corpus(DataframeSource(df1))

from:
corp1 <- Corpus(VectorSource(df1$words))

Thanks /


-----------------------------------------------------------------------------------------------------------------------------

Some of the references I explored:
http://stackoverflow.com/questions/25638503/tm-loses-the-metadata-when-applying-tm-map
http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument
http://stackoverflow.com/questions/24771165/r-project-no-applicable-method-for-meta-applied-to-an-object-of-class-charact
http://stackoverflow.com/questions/25551514/termdocumentmatrix-errors-in-r
http://stackoverflow.com/questions/20699111/tm-map-error-message-in-r
http://stackoverflow.com/questions/31996891/error-in-usemethodmeta-x-no-applicable-method-for-meta-applied-to-an-ob
http://stackoverflow.com/questions/11876740/r-stemming-a-string-document-corpus

	[[alternative HTML version deleted]]



More information about the R-help mailing list