[R] Cluster analysis using term frequencies
Sun Shine
phaedrusv at gmail.com
Tue Mar 24 12:55:59 CET 2015
Hi list
I am using the 'tm' package to review meeting notes at a school to
identify terms frequently associated with 'learning', 'sports', and
'extra-mural' activities, and then to sort any terms according to these
three headers in a way that could be supported statistically (as opposed
to, say, my own bias, etc.).
To accomplish this, I have done the following:
(1) After the usual pre-processing of the text data, loading it as a
corpus and then converting it into a document term matrix (called
'allTerms'), I have identified the 20 most frequently occurring terms in
the meeting notes and extracted these into a named vector called
'freqTerms'. Many of the terms returned have nothing to do with any of
the three themes of 'learning', 'sports', or 'extra-mural'.
(2) Therefore, I have also manually generated a list of terms and
synonyms for 'learning' and 'sports', etc. (e.g. 'football', 'soccer',
'drama', 'chess', etc.) and then tested for the occurrence of each of
these terms in the corpus, e.g.:
> allTerms['soccer']
and have come up with a list of some 30 terms together with their
frequencies. I manually sorted these according to three headers
'learning', 'sports', and 'extra-mural' and dropped these into a table
in a word processing document. Some of these terms are also in the
freqTerms vector.
What I want to do now is to use cluster analysis (hclust, from the
'cluster' library) to plot a dendrogram of the terms I have manually
checked and put into the table, in order to see how closely similar the
terms are and whether they cluster in ways similar to the way as I
manually sorted these under the table column headers of 'learning',
'sports', and 'extra-mural'.
To do this, I dropped these manually sorted terms into a data frame
together with the associated values (which I called 'tes.df') and then
tried plotting this as follows:
> dtes <- dist(tes.df, method = 'euclidean')
> dtesFreq <- hclust(dtes, method = 'ward.D')
> plot(dtesFreq, labels = names(tes.df))
However, I get an error message when trying to plot this: "Error in
graphics:::plotHclust(n1, merge, height, order(x$order), hang, :
invalid dendrogram input".
I'm clearly screwing something up, either in my source data.frame or in
my setting hclust up, but don't know which, nor how.
More than just identifying the error however, I am interested in finding
a smart (efficient/ elegant) way of checking the occurrence and
frequency value of the terms that may be associated with 'sports',
'learning', and 'extra-mural' and extracting these into a matrix or data
frame so that I can analyse and plot their clustering to see if how I
associated these terms is actually supported statistically.
I'm sure that there must be a way of doing this in R, but I'm obviously
not going about it correctly. Can anyone shine a light please?
Thanks for any help/ guidance.
Regards,
Sun
More information about the R-help
mailing list