[R] Comparing membership of clusters

Fri Nov 27 10:32:21 CET 2009

Hello,

I'm taking several physiological measurements on participants (e.g.,
skin conductivity, heart rate, etc.). I know that those participants
belong to one of three groups (from another measurement), and I'm
looking to find the physiological measurement that best describes
group membership. The measurements are taken over several days and I
computed an lm() for each participant for each measurement and used
the regression coefficient as input for a hclust(). After cutree(x,
k=3), I have a matrix with in columns group indices for each
measurement. Now I need to assess which column is most similar to my
gold standard.

Q1: how to easiest and best abstract away from group labeling (because
that's arbitrary, see below)?
Q2: is there a statistic to compute level of similarity (other than tallying)?

So I have (after the cutree)

res <- matrix(c(1,1,1,2,1,3,2,1,1,1,2,1,1,3,1,3,3,3,1,2,1,1,1,2,1,3,1,1,2,2,2,1,3,1,2,2,1,
1,1,2,2,3,1,1,1,1,2,1,2,1,2,3,2,1,1,2,3,2,2,1,2,2,1,1,1,1,2,1,3,1,1,2,1,2,
2,1,2,1,2,1,3,1,2,2,3,1,2,1,2,2,1,1,1,1,1,2,3,3,1,1,1,1,1,1,1,2,3,3,1,1,2,
1,1,3,2,2,2), nrow=9)
colnames(res) <- LETTERS[1:13]

which has the cluster assignments for each measurement in the columns.
My gold standard is

gold <- c(1, 1, 2, 2, 3, 3, 1, 2, 3)

Now for each column in res, I want to see how similar it is to gold.
Note that exact matching on number identity is not correct, because
the gold standard could also be expressed as c(a, a, b, b, c, c, a, b,
c), or even c(3, 3, 1, 1, 2, 2, 3, 1, 2). So the fact that participant
(index) 1, 2, and 8 belong to each other is key.

I am most puzzled about how to do the matching / find the similarity
between each column and gold standard.

Thank you for your time!
best regards,
Paul Lemmens