[R] Similarity matrix

Wed Apr 11 13:53:46 CEST 2001

Thanks very much to Brian Ripley, Kaspar Pflugshaupt, and Jari Oksanen
for addressing this issue.

The S-Plus online help sheds no light on the issue.  The S-Plus
statistics manual has a lot of information on clustering, but
only focuses on distance measures, as similarity measures
are only allowed in a minority of the clustering functions.

Brian Ripley did the test that I should have done to show
that hclust is using a simple translation from similarity
to distance.

The kinds of similarities I routinely use are
- pairwise squared Spearman rank correlation coefficients
- pairwise proportion of the time that two variables are
  missing on the same observation
- Hoeffding D nonparametric dependence index 
  (the scaling of which may be more problematic than the other two)

Thank you all,

Frank Harrell

Prof Brian Ripley wrote:
> 
> On Tue, 10 Apr 2001, Frank E Harrell Jr wrote:
> 
> > I frequently use hclust on a similarity matrix.  In R only a
> > distance matrix is allowed.  Is there a simple reliable
> > transformation of a similarity matrix that will result
> > in a distance matrix making hclust work the same as
> > S-Plus with a similarity matrix?  Venables & Ripley 3rd
> > edition implies that a simple reversal of values
> > will suffice.  Thanks -Frank
> 
> Testing with Splus 6.0 shows that dist = 1 - sim is used there, so the
> simple assumption is correct.
> 
> d <- dist(longley.y)
> d <- d/max(d)
> hclust(d, "ave")
> $merge:
>       [,1] [,2]
>  [1,]   -2   -4
>  [2,]   -6   -8
>  [3,]   -1   -3
>  [4,]  -14  -15
>  [5,]  -10  -11
>  [6,]   -5    2
>  [7,]   -9  -12
>  [8,]  -13    5
>  [9,]    1    3
> [10,]  -16    4
> [11,]   -7    7
> [12,]    8   10
> [13,]    6   11
> [14,]    9   13
> [15,]   12   14
> 
> $height:
>  [1] 0.006262043 0.011753372 0.014643545 0.022447014 0.030057803 0.046146438
>  [7] 0.047591522 0.061849713 0.087427750 0.106310219 0.123025045 0.153018638
> [13] 0.221579969 0.384352922 0.570969820
> 
> $order:
>  [1] 13 10 11 16 14 15  2  4  1  3  5  6  8  7  9 12
> 
> hclust(sim=1-d, method="ave")
> $merge:
>       [,1] [,2]
>  [1,]   -2   -4
>  [2,]   -6   -8
>  [3,]   -1   -3
>  [4,]  -14  -15
>  [5,]  -10  -11
>  [6,]   -5    2
>  [7,]   -9  -12
>  [8,]  -13    5
>  [9,]    3    1
> [10,]  -16    4
> [11,]   -7    7
> [12,]   10    8
> [13,]   11    6
> [14,]   13    9
> [15,]   14   12
> 
> $height:
>  [1] 0.9937379 0.9882466 0.9853565 0.9775530 0.9699422 0.9538536 0.9524085
>  [8] 0.9381503 0.9125723 0.8936898 0.8769749 0.8469813 0.7784200 0.6156471
> [15] 0.4290302
> 
> $order:
>  [1]  7  9 12  5  6  8  1  3  2  4 16 14 15 13 10 11
> 
> which is the same but expressed in similarities.
> 
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272860 (secr)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  http://hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._