[R-sig-eco] Clustering estimated regression coefficients

Tue Oct 23 15:58:14 CEST 2012

Chris,

I think hclust will work just fine; you just have to construct an appropriate distance matrix from the coefficients.  A major part of the beauty of distance-based analyses is that you have to choose a distance measure.  When you think hard about that choice, the analysis better represents what you want to know.

First, variables absent from a model have 0 values for their coefficients, not NA.  A model that omits X3 says there is no relationship between Y and X3, hence 0 for that beta.

What would be an appropriate distance matrix for your goals?  Those are " I would like to cluster the data so that observations are grouped according to: the similarity in the direction of the relationship ( i.e. +ve or -ve) and the presence/absence of variables."
I'm not sure if these are two criteria leading to two clusterings or an aggregate criterion.  Easiest if they are two separate criteria.  For similarity in direction, use a simple matching coefficient (proportion of coefficients in the same direction).  Ignore any coefficient with 0 in either model (NA's may simplify this computation).  For presence/absence, again a simple matching coefficient (proportion of coefficients present in both models or absent in both models).   You could easily elaborate on these distance measures, e.g. by incorporating magnitude of coefficient.  That's where the thinking becomes really important.

Once you have an appropriate distance matrix (or matrices), then you turn the hclust crank.

Philip Dixon