[R] Hidden Problems with Clustering Algorithms

Tue Nov 22 04:28:07 CET 2022

Dear R-Users,

Hidden Problems with Clustering Algorithms

I stumbled recently upon a presentation about hierarchical clustering. 
Unfortunately, it contains a hidden problem of clustering algorithms. 
The problem is deeper and I think that it warrants a closer inspection 
by the statistical community.

The presentation is available online. Both the scaled & non-scaled 
versions show the problem.

de.NBI course - Advanced analysis of quantitative proteomics data using 
R: 03b Clustering Part2
[Note: it's more like introductory notes to basic statistics]
https://www.youtube.com/watch?v=7e1uW_BhljA
times:
- at 6:15 - 6:28 & 6:29 - 7:10 [2 versions, both non-scaled]
- at 5:51 - 6:10 [the scaled version]
- same problem at 7:56;

PROBLEM

Non-Scaled Version: (e.g. the one at 6:15)
- the upper 2 rows are split into various sub-clusters;
- the top tree: a cluster is formed by the right-right sub-tree (some 17 
"genes" or similar "activities" / "expressions");
- the left-most 2 "genes" are actually over-expressed "genes" and 
functionally really belong to the previous/right sub-cluster;

Scaled-Version: (at 5:52)
- the left-most 2 "genes" are over-expressed at the same time with the 
right cluster, and not otherwise;

Unfortunately, the 2 over-expressed (outliers or extreme-values) are 
split off from the relevant cluster and inserted as a separate 
main-branch in the top dendrogram. Switching only the main left & right 
branches in the top tree would only mask this problem. The 2 
pseudo-outliers are really the (probably) upper values in the larger 
cluster of over-expressed "genes" (all the dark genes should belong to 
the same cluster).

The middle sub-cluster shows really NO activity (some 16 "genes"). The 
main branches in the top tree should really split between this 
*NO*-activity cluster and the cluster showing activity (including the 2 
massively over-expressed genes). The problem is present in the scaled 
version as well.

The hierarchical clustering algorithm fails. I have not analysed the 
data, but some problems may contribute to this:
- "gene expression" or "activity" may not be linear, but exponential or 
follow some power rule: a logarithmic transformation (or some other 
transformation) may have been useful;
- simple distances between clusters may be too inaccurate;
- the variance in the low-activity (middle) cluster may be very low 
(almost 0!), while the variance in the high-activity cluster may be much 
higher: the Mahalanobis distance or joining the sub-clusters based on 
some z/t-test taking into account the different variances may be more 
robust;

These questions should be addressed by more senior statisticians.

I hope that the presentation remains on-line as is, as the clustering 
problem is really easy to see and to analyse. It is impossible to detect 
and visualise such anomalies in a heatmap with 1,000 gene-expressions or 
with 10,000 genes, or with 500-1000 samples. It is very obvious on this 
small heatmap.

I do not know if there are any robust tools to validate the generated 
trees. Inspecting by "eye" a dendrogram with > 1,000 genes and hundreds 
of samples is really futile.

Sincerely,

Leonard