[R] Problems with tm package, Removeword and trasformations

Renato Medei medei.ren at gmail.com
Tue Feb 10 15:52:58 CET 2015


Dear all,
I'm sorry but  as all the newbies  I have a lot of problems to solve.
I'm using R 3.1.2 under osx  10.10.2.
I'm working with tm to analyze some tweets and I received some strange
errors when I tried to remove stopwords (See below error 1), to transform
content (See below error 2) and to create document term Matrix (See below
error 3)
Could anyone help me?

Error 1
> tweets = searchTwitter("rimini", n=1000)
> tweets = sapply(tweets, function(x) x$getText())
> tweets_corpus = Corpus(VectorSource(tweets))
> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
> tweets_corpus <- tm_map(tweets_corpus, toSpace,
"(f|ht)tp(s?)://(.*)[.][a-z]+")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
"Riviera", "riviera"))
> tweets_corpus <- tm_map(tweets_corpus, stopwords("italian"))
Warning message:
In mclapply(content(x), FUN, ...) :
  all scheduled cores encountered errors in user code

Error2
> tweets = searchTwitter("rimini", n=1000)
> tweets = sapply(tweets, function(x) x$getText())
> tweets_corpus = Corpus(VectorSource(tweets))
> tweets_corpus
<<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>
> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
> tweets_corpus <- tm_map(tweets_corpus, toSpace,
"(f|ht)tp(s?)://(.*)[.][a-z]+")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
"Riviera", "riviera"))
> tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))
Warning message:
In mclapply(content(x), FUN, ...) :
  all scheduled cores encountered errors in user code


Error3

> tweets = searchTwitter("rimini", n=1000)
> tweets = sapply(tweets, function(x) x$getText())
> tweets_corpus = Corpus(VectorSource(tweets))
> tweets_corpus
<<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>
> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
> tweets_corpus <- tm_map(tweets_corpus, toSpace,
"(f|ht)tp(s?)://(.*)[.][a-z]+")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
"Riviera", "riviera"))
> dtm <- DocumentTermMatrix(tweets_corpus)
Errore in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow =
length(allTerms),  :
  'i, j, v' different lengths
Inoltre: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
  all scheduled cores encountered errors in user code
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow =
length(allTerms),  :
  si è prodotto un NA per coercizione


Thank you for your help

	[[alternative HTML version deleted]]



More information about the R-help mailing list