[R] Error in FUN(content(x), ...) : invalid input in 'utf8towcs' in Twitter Analysis using TM package

Shivi Bhatia shivipmp82 at gmail.com
Tue Jan 24 18:40:41 CET 2017


Hi All,
I am working on a twitter analysis using the TM package. Below are some
codes:

1- Here i am creating a data frame of the data collected from twitter
chennai=as.data.frame(cbind(tweet=jallitext,date=jallidate,lat=jallilat,lon=jallilon,
                         isretweet=isretweet,retweeted=retweeted,
retweetcount=retweetcount,favorite=favoritesCount,
                         favorited=favorited))

2- corpus<- Corpus(VectorSource(chennai$tweet)) The output gives me:
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6000

However while changing the text to lower using the tm package i get this
error:

Error in FUN(content(x), ...) :
  invalid input 'RT @Aariactor: Officially #jallikattu protest is over
yesterday we won í ½í²ªí ¼í¿» thx to government í ½í¹ í ¼í¿»' in
'utf8towcs'.

After researching a lot i am using this code:-
tryTolower = function(x)
{
  # create missing value
  # this is where the returned value will be
  y = NA
  # tryCatch error
  try_error = tryCatch(tolower(x), error = function(e) e)
  # if not an error
  if (!inherits(try_error, "error"))
    y = tolower(x)
  return(y)
}
corpus<- sapply(corpus, function(x) tryTolower(x))
This makes the tweets case sensitive but  when i create a document term
matrix i get this error:

Jalli<- DocumentTermMatrix(corpus)
Error in UseMethod("TermDocumentMatrix", x) :
  no applicable method for 'TermDocumentMatrix' applied to an object of
class "character"

Request you to please assist with this error. Thank you.

	[[alternative HTML version deleted]]



More information about the R-help mailing list