[R] Error in FUN(content(x), ...) : invalid input in 'utf8towcs' in Twitter Analysis using TM package
Shivi Bhatia
shivipmp82 at gmail.com
Tue Jan 24 18:40:41 CET 2017
Hi All,
I am working on a twitter analysis using the TM package. Below are some
codes:
1- Here i am creating a data frame of the data collected from twitter
chennai=as.data.frame(cbind(tweet=jallitext,date=jallidate,lat=jallilat,lon=jallilon,
isretweet=isretweet,retweeted=retweeted,
retweetcount=retweetcount,favorite=favoritesCount,
favorited=favorited))
2- corpus<- Corpus(VectorSource(chennai$tweet)) The output gives me:
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 6000
However while changing the text to lower using the tm package i get this
error:
Error in FUN(content(x), ...) :
invalid input 'RT @Aariactor: Officially #jallikattu protest is over
yesterday we won í ½í²ªí ¼í¿» thx to government í ½í¹ í ¼í¿»' in
'utf8towcs'.
After researching a lot i am using this code:-
tryTolower = function(x)
{
# create missing value
# this is where the returned value will be
y = NA
# tryCatch error
try_error = tryCatch(tolower(x), error = function(e) e)
# if not an error
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}
corpus<- sapply(corpus, function(x) tryTolower(x))
This makes the tweets case sensitive but when i create a document term
matrix i get this error:
Jalli<- DocumentTermMatrix(corpus)
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of
class "character"
Request you to please assist with this error. Thank you.
[[alternative HTML version deleted]]
More information about the R-help
mailing list