[R] calculating similarity/distance among hierarchically classified items

Michael Friendly friendly at yorku.ca
Thu Apr 6 18:16:50 CEST 2006

This is a question about how to calculate similarities/distances
among items that are classified by hierarchical attributes
for the purpose of visualizing the relations among items by means
of clustering, MDS, self-organizing maps, and so forth.

I have a set of ~260 items that have been classified using two sets of
hierarchically-organized codes on the basis of form and content.  The
data looks like that below, where the last two variables (ITEMFORM and
ITEMCONTENT) are each a ';' separated list of codes assigned to each
item. The items are identified by the KEY variable. (Other fields are
ignored here.)


The codes are hierarchical in the sense that, e.g.,
C321 corresponds to the levels in a tree,
    Commerce (C3) > Internal (C32) > Labour (C321)
F5G corresponds to
    Diagram (F5) > Nomogram (F5G)
so the number of characters in a code is the level in the tree.

There are about 290 distinct codes, with varying frequency of use,
from 1 ..~40, so the data could be rearranged to a 260x290 incidence matrix
of items x codes.  In computing similarities between items, all measures
I know of for binary attribute data treat the attributes as nominal, and
so ignore the hierarchical nature of the codes. 

To take that into account, the 0/1 values could be replaced by the
tree level values (0=NA, 1..5) of the codes in each column.  Then some
measure of similarity could be computed based on the profiles for each
pair of items. 

But I don't know what measure (Gower's, euclidean, etc.) would be (most,
or arguably) appropriate here. Is this a situation that anyone recognizes?
Or, maybe there is another way to approach this.  I'd appreciate any

Michael Friendly     Email: friendly at yorku.ca 
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA

More information about the R-help mailing list