[R] calculating similarity/distance among hierarchically classified items
friendly at yorku.ca
Thu Apr 6 18:16:50 CEST 2006
This is a question about how to calculate similarities/distances
among items that are classified by hierarchical attributes
for the purpose of visualizing the relations among items by means
of clustering, MDS, self-organizing maps, and so forth.
I have a set of ~260 items that have been classified using two sets of
hierarchically-organized codes on the basis of form and content. The
data looks like that below, where the last two variables (ITEMFORM and
ITEMCONTENT) are each a ';' separated list of codes assigned to each
item. The items are identified by the KEY variable. (Other fields are
The codes are hierarchical in the sense that, e.g.,
C321 corresponds to the levels in a tree,
Commerce (C3) > Internal (C32) > Labour (C321)
F5G corresponds to
Diagram (F5) > Nomogram (F5G)
so the number of characters in a code is the level in the tree.
There are about 290 distinct codes, with varying frequency of use,
from 1 ..~40, so the data could be rearranged to a 260x290 incidence matrix
of items x codes. In computing similarities between items, all measures
I know of for binary attribute data treat the attributes as nominal, and
so ignore the hierarchical nature of the codes.
To take that into account, the 0/1 values could be replaced by the
tree level values (0=NA, 1..5) of the codes in each column. Then some
measure of similarity could be computed based on the profiles for each
pair of items.
But I don't know what measure (Gower's, euclidean, etc.) would be (most,
or arguably) appropriate here. Is this a situation that anyone recognizes?
Or, maybe there is another way to approach this. I'd appreciate any
Michael Friendly Email: friendly at yorku.ca
Professor, Psychology Dept.
York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT M3J 1P3 CANADA
More information about the R-help