[R] Re: cluster summary score

Huan Huang huang at stats.ox.ac.uk
Thu Aug 8 14:23:05 CEST 2002

```Dear Prof. Harrell and R list,

I have done the variable clustering and summary scores. Thanks a lot for

But it hasn't solved the collinearity problem in my dataset. Afer the
clustering and transcan, there is still very strong collinearity between the
summary scores. The objective of my project is to find out the influential
variables. I believe any variable resuction is not appropriate when the
collinearity exists. I am thinking about the principal component regression
and variable reduction based on it (Rudolf J. Freund and William J. Wilson
(1998), P215).

Does anybody have suggestion on the variable resuction under this condition?
I will appreciate any kind imformation.

Best

Huan
----- Original Message -----
From: "Frank E Harrell Jr" <fharrell at virginia.edu>
To: "Huan Huang" <huang at stats.ox.ac.uk>
Sent: Sunday, August 04, 2002 7:56 PM
Subject: Re: cluster summary score

> On Sun, 4 Aug 2002 19:48:22 +0100
> Huan Huang <huang at stats.ox.ac.uk> wrote:
>
> >
> >
> > > This was just done by
> > >
> > > f <- lrm(y ~ all cluster summary scores)
> > > fastbw(f, suitable stopping criteria)
> >
> > Thank you very much for your kind reply. But I don't know how to get the
> > cluster summary score.
> >
> > I did:
> > t <- transcan(x, transform = T)
> > t\$transform
> >
> > I got a new matrix, with the transformed value for each variable. How
can I
> > get the cluster summary scores?
>
> You see the little pc1 function I defined in Hmisc?  I just do things like
>
> p1 <- pc1(t\$transform) or pct1(t\$transform[,c(3,5,7)]) to use variables
3,5,7
>
> Frank
>
> >
> > Huan
> >
> > >
> > > Doing the fast backward stepdown is safer with cluster scores than
with
> > raw variables, especially if you use conservative stopping criteria
(e.g.,
> > large alpha).  I allowed "highly insignificant" cluster scores to be
> > dropped, and did not ever look at their component variables again.
> > >
> > > Frank
> > >
> > > >
> > > > Actually I am doing  my thesis project. My explanatory variables
have
> > > > serious collinearity. I have used the function transcan and varclus
on
> > the
> > > > variables and find out some clusters. I am trying to use the method
> > > > introduced in this section to drop some variables. I want to know
how
> > you
> > > > carry out the cluster summary scores.
> > > >
> > > > Thanks a lot and looking forward to hearing from you.
> > > >
> > > > Huan
> > > > ----- Original Message -----
> > > > From: "Frank E Harrell Jr" <fharrell at virginia.edu>
> > > > To: <pmj at jciconsult.com>
> > > > Cc: <r-help at stat.math.ethz.ch>
> > > > Sent: Sunday, August 04, 2002 4:36 PM
> > > > Subject: Re: [R] Pseudo R^2 for logit - really naive question
> > > >
> > > >
> > > > > The Nagelkerke R^2 is commonly used.   The lrm function in the
Design
> > > > library computes this for logistic regression.  The numerator is 1 -
> > > > exp(-LR/n) where LR is the likelihood ratio chi-square stat and n is
the
> > > > total sample size.  Divide it by the maximum attainable value of
this if
> > the
> > > > model is perfect (which is a simple function of the -2 log
likelihood
> > with
> > > > an intercept-only model) to get Nagelkerke's R^2.  The numerator is
> > exactly
> > > > the ordinary R^2 in OLS, as LR = -n log(1-R^2) there.  For a more
> > > > interpretable index and one that measures purely discrimination
ability,
> > the
> > > > ROC area or "C index" which is essentially a Mann-Whitney statistic
> > based on
> > > > concordance probability is recommended.  The lrm function also
outputs
> > this
> > > > or you can get it from the somers2 or rcorr.cens functions in the
Hmisc
> > > > library.
> > > > >
> > > > > Frank Harrell
> > > > >
> > > > > On Sun, 4 Aug 2002 09:08:46 -0400
> > > > > "Paul M. Jacobson" <pmj at jciconsult.com> wrote:
> > > > >
> > > > > > I am using GLM to calculate logit models based on
cross-sectional
> > data.
> > > > I
> > > > > > am now down to the hard work of making the results intelligible
to
> > very
> > > > > > average readers.  Is there any way to calculate a psuedo
analoque to
> > the
> > > > R^2
> > > > > > in standard linear regression for use as a purely descriptive
> > statistic
> > > > of
> > > > > > goodness of fit? Most of the readers of my report will be
vaguely
> > > > familiar
> > > > > > and more comfortable with R^2 than with any other regression
> > > > diagnostics.
> > > > > >
> > > > > > Paul M. Jacobson
> > > > > >
> > > > >
> > > >
> >
> >
> >
> > >
> > >
> >
>
>
```