# [R] Re: cluster summary score

Jonathan Baron baron at cattell.psych.upenn.edu
Thu Aug 8 15:37:05 CEST 2002

```On 08/08/02 13:23, Huan Huang wrote:
>Dear Prof. Harrell and R list,
>
>I have done the variable clustering and summary scores. Thanks a lot for
>
>But it hasn't solved the collinearity problem in my dataset. Afer the
>clustering and transcan, there is still very strong collinearity between the
>summary scores. The objective of my project is to find out the influential
>variables. I believe any variable resuction is not appropriate when the
>collinearity exists. I am thinking about the principal component regression
>and variable reduction based on it (Rudolf J. Freund and William J. Wilson
>(1998), P215).
>
>Does anybody have suggestion on the variable resuction under this condition?
>I will appreciate any kind imformation.

I'm not sure what you mean by resuction, but when I and many
other psychologists face this kind of problem - reducing a set of
variables - we often use factor analysis.  A good progam is
factanal in the mva library.  Varimax rotation (the default)
usuallly picks out a sensible set of factors, although of course
sort the loadings if you want.  (Look at the various options for

There are no fixed rules for this sort of thing.  Sometimes one
variable winds up in the wrong place by chance.  The strategy I
use is to figure out a sensible grouping of variables before I
use them to predict anything, so that I am not biased by knowing
the results.  So I feel free to move or remove variables that
don't make sense.  Some people may prefer a more rigid approach,
which further reduces the temptation to cheat.

Having found the grouping of variables, you can do three
different things:

1. Define "scores" by simply adding up the (standardized?) scores
of the variables in each group (with high loadings in the same
factor, perhaps).

2. Use the factor scores themselves as variables.

3. Use a single representative variable from each group.  This
seems to be what you were suggesting, but I'm having trouble
thinking of a situation where this would be better than #1 or
#2.

Whatever you do, you need to figure out how many groups, and
prcomp() or princomp() is often helpful here.  (And take a look
at biplot().  A really nice tool for looking at the first two
principal components.)  The factanal() program also reports a
chi-square fit statistic.  So in principle you could use that to
figure out how many factors there are.  However, that method
usually gives more factors than are meaningful, especially when
you have a large data set.

--
Jonathan Baron, Professor of Psychology, University of Pennsylvania