[R] Problems with tm package, Removeword and trasformations
Amos B. Elberg
amos.elberg at gmail.com
Tue Feb 10 20:08:55 CET 2015
Trying to use t m to analyze tweets, you're going to experience a long stream of issues like the one you found, which generally relate to text formatting. I worked through them over the past few months for a project. If you email me offline I'll try to help and share some example code.
> On Feb 10, 2015, at 9:52 AM, Renato Medei <medei.ren at gmail.com> wrote:
>
> Dear all,
> I'm sorry but as all the newbies I have a lot of problems to solve.
> I'm using R 3.1.2 under osx 10.10.2.
> I'm working with tm to analyze some tweets and I received some strange
> errors when I tried to remove stopwords (See below error 1), to transform
> content (See below error 2) and to create document term Matrix (See below
> error 3)
> Could anyone help me?
>
> Error 1
>> tweets = searchTwitter("rimini", n=1000)
>> tweets = sapply(tweets, function(x) x$getText())
>> tweets_corpus = Corpus(VectorSource(tweets))
>> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
>> tweets_corpus <- tm_map(tweets_corpus, toSpace,
> "(f|ht)tp(s?)://(.*)[.][a-z]+")
>> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
>> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
>> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
>> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
>> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
> "Riviera", "riviera"))
>> tweets_corpus <- tm_map(tweets_corpus, stopwords("italian"))
> Warning message:
> In mclapply(content(x), FUN, ...) :
> all scheduled cores encountered errors in user code
>
> Error2
>> tweets = searchTwitter("rimini", n=1000)
>> tweets = sapply(tweets, function(x) x$getText())
>> tweets_corpus = Corpus(VectorSource(tweets))
>> tweets_corpus
> <<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>
>> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
>> tweets_corpus <- tm_map(tweets_corpus, toSpace,
> "(f|ht)tp(s?)://(.*)[.][a-z]+")
>> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
>> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
>> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
>> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
>> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
> "Riviera", "riviera"))
>> tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))
> Warning message:
> In mclapply(content(x), FUN, ...) :
> all scheduled cores encountered errors in user code
>
>
> Error3
>
>> tweets = searchTwitter("rimini", n=1000)
>> tweets = sapply(tweets, function(x) x$getText())
>> tweets_corpus = Corpus(VectorSource(tweets))
>> tweets_corpus
> <<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>
>> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
>> tweets_corpus <- tm_map(tweets_corpus, toSpace,
> "(f|ht)tp(s?)://(.*)[.][a-z]+")
>> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ")
>> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+")
>> tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
>> tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
>> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini",
> "Riviera", "riviera"))
>> dtm <- DocumentTermMatrix(tweets_corpus)
> Errore in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow =
> length(allTerms), :
> 'i, j, v' different lengths
> Inoltre: Warning messages:
> 1: In mclapply(unname(content(x)), termFreq, control) :
> all scheduled cores encountered errors in user code
> 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow =
> length(allTerms), :
> si è prodotto un NA per coercizione
>
>
> Thank you for your help
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list