[R] calculating similarity/distance among hierarchically classified items

Thu Apr 6 18:16:50 CEST 2006

This is a question about how to calculate similarities/distances
among items that are classified by hierarchical attributes
for the purpose of visualizing the relations among items by means
of clustering, MDS, self-organizing maps, and so forth.

I have a set of ~260 items that have been classified using two sets of
hierarchically-organized codes on the basis of form and content.  The
data looks like that below, where the last two variables (ITEMFORM and
ITEMCONTENT) are each a ';' separated list of codes assigned to each
item. The items are identified by the KEY variable. (Other fields are
ignored here.)

KEY,YEAR,WHERE,CONTENT,FORM,ITEMFORM,ITEMCONTENT
1782Fourcroy,1782,Eur,Hdem,Stats,F5;F5K;F5N;F5N1,C8;C82
1785Crome,1785,Eur,Pdem,Stats,F5;F5N;F5N1,C7
1786Playfair,1786,Eur,Hdem,Stats,F6;F68;F69;F61;F62,C3;C32;C321;C323
1787Chladni,1787,Eur,Other,Other,F5;F55;FH;FD;FD3,C9;C95
1794Buxton,1794,Eur,Other,Tech,F3;F31;F7;F72;F722,C9;C9A
1795Pouchet,1795,Eur,Math,Stats,F5;F5G;FG;FG7,C2
1796Watt,1796,Eur,Pdem,Tech,FGB,C7;C9;C9A
1798Senefelder,1798,Eur,Other,Tech,FB;F5,C9;C97
 ...

The codes are hierarchical in the sense that, e.g.,
C321 corresponds to the levels in a tree,
    Commerce (C3) > Internal (C32) > Labour (C321)
F5G corresponds to
    Diagram (F5) > Nomogram (F5G)
so the number of characters in a code is the level in the tree.

There are about 290 distinct codes, with varying frequency of use,
from 1 ..~40, so the data could be rearranged to a 260x290 incidence matrix
of items x codes.  In computing similarities between items, all measures
I know of for binary attribute data treat the attributes as nominal, and
so ignore the hierarchical nature of the codes. 

To take that into account, the 0/1 values could be replaced by the
tree level values (0=NA, 1..5) of the codes in each column.  Then some
measure of similarity could be computed based on the profiles for each
pair of items. 

But I don't know what measure (Gower's, euclidean, etc.) would be (most,
or arguably) appropriate here. Is this a situation that anyone recognizes?
Or, maybe there is another way to approach this.  I'd appreciate any
suggestions.

-- 
Michael Friendly     Email: friendly at yorku.ca 
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA