[R] Logistic regression problem

Wed Oct 1 18:01:19 CEST 2008

From: Frank E Harrell Jr
> 
> Bernardo Rangel Tura wrote:
> > Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:
> >> Bernardo Rangel Tura wrote:
> >>> Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
> >>>> I have a huge data set with thousands of variable and one binary
> >>>> variable. I know that most of the variables are 
> correlated and are not
> >>>> good predictors... but...
> >>>>
> >>>> It is very hard to start modeling with such a huge 
> dataset. What would
> >>>> be your suggestion. How to make a first cut... how to 
> eliminate most
> >>>> of the variables but not to ignore potential interactions... for
> >>>> example, maybe variable A is not good predictor and 
> variable B is not
> >>>> good predictor either, but maybe A and B together are good
> >>>> predictor...
> >>>>
> >>>> Any suggestion is welcomed
> >>>
> >>> milicic.marko
> >>>
> >>> I think do you start with a rpart("binary variable"~.)
> >>> This show you a set of variables to start a model and the 
> start set to
> >>> curoff  for continous variables
> >> I cannot imagine a worse way to formulate a regression 
> model.  Reasons 
> >> include
> >>
> >> 1. Results of recursive partitioning are not trustworthy 
> unless the 
> >> sample size exceeds 50,000 or the signal to noise ratio is 
> extremely high.
> >>
> >> 2. The type I error of tests from the final regression 
> model will be 
> >> extraordinarily inflated.
> >>
> >> 3. False interactions will appear in the model.
> >>
> >> 4. The cutoffs so chosen will not replicate and in effect 
> assume that 
> >> covariate effects are discontinuous and piecewise flat.  
> The use of 
> >> cutoffs results in a huge loss of information and power 
> and makes the 
> >> analysis arbitrary and impossible to interpret (e.g., a 
> high covariate 
> >> value:low covariate value odds ratio or mean difference is 
> a complex 
> >> function of all the covariate values in the sample).
> >>
> >> 5. The model will not validate in new data.
> > 
> > Professor Frank,
> > 
> > Thank you for your explain.
> > 
> > Well, if my first idea is wrong what is your opinion on the 
> following
> > approach?
> > 
> > 1- Make PCA with data excluding the binary variable
> > 2- Put de principal components in logistic model
> > 3- After revert principal componentes in variable (only if is
> > interesting for milicic.marko)
> > 
> > If this approach is wrong too what is your approach?
> 
> 
> Hi Bernardo,
> 
> If there is a large number of potential predictors and no previous 
> knowledge to guide the modeling, principal components (PC) is 
> often an 
> excellent way to proceed.  The first few PCs can be put into 
> the model. 
>   The result is not always very interpretable, but you can 
> "decode" the 
> PCs by using stepwise regression or recursive partitioning (which are 
> safer in this context because the stepwise methods are not exposed to 
> the Y variable).  You can also add PCs in a stepwise fashion in the 
> pre-specified order of variance explained.
> 
> There are many variations on this theme including nonlinear principal 
> components (e.g., the transcan function in the Hmisc package) 
> which may 
> explain more variance of the predictors.

While I agree with much of what Frank had said, I'd like to add some points.

Variable selection is a treacherous business whether one is interested in
prediction or inference.  If the goal is inference, Frank's book is a
must read, IMHO.  (It's great for predictive model building, too.)

If interaction is of high interest, principal components are not going
to give you that.

Regarding cutpoint selection:  The machine learners have found that
the `optimal' split point for a continuous predictor in tree algorithms
are extremely variable, that interpreting them would be risky at best.
Breiman essentially gave up on interpretation of a single tree when he
went to random forests.

Best,
Andy

> Frank
> -- 
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                       Department of Biostatistics   
> Vanderbilt University
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}