[R] How to reduce the sparseness in a TDM to make a cluster plot readable?

Abby Spurdle @purd|e@@ @end|ng |rom gm@||@com
Thu Sep 17 09:43:12 CEST 2020


I'm not familiar with these subjects.
And hopefully, someone who is, will offer some better suggestions.

But to get things started, maybe...
(1) What packages are you using (re: tdm)?
(2) Where does the problem happen, in dist, hclust, the plot method
for hclust, or in the package(s) you are using?
(3) Do you think you could produce a small reproducible example,
showing what is wrong, and explaining you would like it to do instead?

Note that if the problem relates to hclust, or the plot method, then
you should be able to produce a much simpler example.
e.g.

    mycount.matrix <- matrix (rpois (25000, 20),, 5)
    head (mycount.matrix, 3)
    tail (mycount.matrix, 3)

    plot (hclust (dist (mycount.matrix) ) )

On Tue, Sep 15, 2020 at 6:54 AM Andrew <phaedrusv using gmail.com> wrote:
>
> Hello all
>
> I am doing some text mining on a set of five plain text files and have
> run into a snag when I run hclust in that there are just too many leaves
> for anything to be read. It returns a solid black line.
>
> The texts have been converted into a TDM which has a dim of 5,292 and 5
> (as per 5 docs).
>
> My code for removing sparsity is as follows:
>
>  > tdm2 <- removeSparseTerms(tdm, sparse=0.99999)
>
>  > inspect(tdm2)
>
> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
> Non-/sparse entries: 10415/16045
> Sparsity           : 61%
> Maximal term length: 22
> Weighting          : term frequency (tf)
>
> While the tf-idf weighting returns this when 0.99999 sparseness is removed:
>
>  > inspect(tdm.tfidf)
> <<TermDocumentMatrix (terms: 5292, documents: 5)>>
> Non-/sparse entries: 7915/18545
> Sparsity           : 70%
> Maximal term length: 22
> Weighting          : term frequency - inverse document frequency
> (normalized) (tf-idf)
>
> I have experimented by decreasing the value I use for decreasing
> sparseness, and that helps a bit, for example:
>
>  > tdm2 <- removeSparseTerms(tdm, sparse=0.215)
>  > inspect(tdm2)
> <<TermDocumentMatrix (terms: 869, documents: 5)>>
> Non-/sparse entries: 3976/369
> Sparsity           : 8%
> Maximal term length: 14
> Weighting          : term frequency (tf)
>
> But, no matter what I do, the resulting plot is unreadable. The code for
> plotting the cluster is:
>
>  > hc <- hclust(dist(tdm2, method = "euclidean"), method = "complete")
>  > plot(hc, yaxt = 'n', main = "Hierarchical clustering")
>
> Can someone kindly either advise me what I am doing wrong and/ or
> signpost me to some detailed info on how to fix this.
>
> Many thanks in anticipation.
>
> Andy
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list