[R] Find common two and three word phrases in a corpus

Abraham Mathew abmathewks at gmail.com
Tue Oct 7 16:51:22 CEST 2014


Let's say I have a corpus and want to find the two, three, etc word phrases
that occur most frequently in the data. I normally do this in the following
manner but am getting an error message and am having some difficulty
diagnosing what is going wrong. Given the following data, I'd just want a
count of 2 for the number of 2 word phrases given that "that sucks" appears
twice.

dat = c("love it", "who goes there", "what is wrong", "that sucks", "that
sucks")

(corpus <- Corpus(VectorSource(dat)))

matrix <- create_matrix(corpus, ngramLength=2)

bww_freq = findFreqTerms(matrix, lowfreq=5)

Here is the error message when I attempt to create a matrix

> (corpus <- Corpus(VectorSource(dat)))
<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>
> matrix <- create_matrix(corpus, ngramLength=2)
Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x),
 :
  dims [product 5] do not match the length of object [3]


Can anyone tell me what could be going wrong? or a workaround? or another
package which could give me the desires result in a more efficient manner.

	[[alternative HTML version deleted]]



More information about the R-help mailing list