[R] Hidden Problems with Clustering Algorithms
|eo@m@d@ @end|ng |rom @yon|c@eu
Tue Nov 22 04:28:07 CET 2022
Hidden Problems with Clustering Algorithms
I stumbled recently upon a presentation about hierarchical clustering.
Unfortunately, it contains a hidden problem of clustering algorithms.
The problem is deeper and I think that it warrants a closer inspection
by the statistical community.
The presentation is available online. Both the scaled & non-scaled
versions show the problem.
de.NBI course - Advanced analysis of quantitative proteomics data using
R: 03b Clustering Part2
[Note: it's more like introductory notes to basic statistics]
- at 6:15 - 6:28 & 6:29 - 7:10 [2 versions, both non-scaled]
- at 5:51 - 6:10 [the scaled version]
- same problem at 7:56;
Non-Scaled Version: (e.g. the one at 6:15)
- the upper 2 rows are split into various sub-clusters;
- the top tree: a cluster is formed by the right-right sub-tree (some 17
"genes" or similar "activities" / "expressions");
- the left-most 2 "genes" are actually over-expressed "genes" and
functionally really belong to the previous/right sub-cluster;
Scaled-Version: (at 5:52)
- the left-most 2 "genes" are over-expressed at the same time with the
right cluster, and not otherwise;
Unfortunately, the 2 over-expressed (outliers or extreme-values) are
split off from the relevant cluster and inserted as a separate
main-branch in the top dendrogram. Switching only the main left & right
branches in the top tree would only mask this problem. The 2
pseudo-outliers are really the (probably) upper values in the larger
cluster of over-expressed "genes" (all the dark genes should belong to
the same cluster).
The middle sub-cluster shows really NO activity (some 16 "genes"). The
main branches in the top tree should really split between this
*NO*-activity cluster and the cluster showing activity (including the 2
massively over-expressed genes). The problem is present in the scaled
version as well.
The hierarchical clustering algorithm fails. I have not analysed the
data, but some problems may contribute to this:
- "gene expression" or "activity" may not be linear, but exponential or
follow some power rule: a logarithmic transformation (or some other
transformation) may have been useful;
- simple distances between clusters may be too inaccurate;
- the variance in the low-activity (middle) cluster may be very low
(almost 0!), while the variance in the high-activity cluster may be much
higher: the Mahalanobis distance or joining the sub-clusters based on
some z/t-test taking into account the different variances may be more
These questions should be addressed by more senior statisticians.
I hope that the presentation remains on-line as is, as the clustering
problem is really easy to see and to analyse. It is impossible to detect
and visualise such anomalies in a heatmap with 1,000 gene-expressions or
with 10,000 genes, or with 500-1000 samples. It is very obvious on this
I do not know if there are any robust tools to validate the generated
trees. Inspecting by "eye" a dendrogram with > 1,000 genes and hundreds
of samples is really futile.
More information about the R-help