[R] How to reduce the sparseness in a TDM to make a cluster plot readable?
Andrew
ph@edru@v @end|ng |rom gm@||@com
Mon Sep 14 20:53:40 CEST 2020
Hello all
I am doing some text mining on a set of five plain text files and have
run into a snag when I run hclust in that there are just too many leaves
for anything to be read. It returns a solid black line.
The texts have been converted into a TDM which has a dim of 5,292 and 5
(as per 5 docs).
My code for removing sparsity is as follows:
> tdm2 <- removeSparseTerms(tdm, sparse=0.99999)
> inspect(tdm2)
<<TermDocumentMatrix (terms: 5292, documents: 5)>>
Non-/sparse entries: 10415/16045
Sparsity : 61%
Maximal term length: 22
Weighting : term frequency (tf)
While the tf-idf weighting returns this when 0.99999 sparseness is removed:
> inspect(tdm.tfidf)
<<TermDocumentMatrix (terms: 5292, documents: 5)>>
Non-/sparse entries: 7915/18545
Sparsity : 70%
Maximal term length: 22
Weighting : term frequency - inverse document frequency
(normalized) (tf-idf)
I have experimented by decreasing the value I use for decreasing
sparseness, and that helps a bit, for example:
> tdm2 <- removeSparseTerms(tdm, sparse=0.215)
> inspect(tdm2)
<<TermDocumentMatrix (terms: 869, documents: 5)>>
Non-/sparse entries: 3976/369
Sparsity : 8%
Maximal term length: 14
Weighting : term frequency (tf)
But, no matter what I do, the resulting plot is unreadable. The code for
plotting the cluster is:
> hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
> plot(hc, yaxt = 'n', main = "Hierarchical clustering")
Can someone kindly either advise me what I am doing wrong and/ or
signpost me to some detailed info on how to fix this.
Many thanks in anticipation.
Andy
[[alternative HTML version deleted]]
More information about the R-help
mailing list