[R] Logistic regression problem

Bernardo Rangel Tura tura at centroin.com.br
Wed Oct 1 11:34:19 CEST 2008


Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:
> Bernardo Rangel Tura wrote:
> > Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
> >> I have a huge data set with thousands of variable and one binary
> >> variable. I know that most of the variables are correlated and are not
> >> good predictors... but...
> >>
> >> It is very hard to start modeling with such a huge dataset. What would
> >> be your suggestion. How to make a first cut... how to eliminate most
> >> of the variables but not to ignore potential interactions... for
> >> example, maybe variable A is not good predictor and variable B is not
> >> good predictor either, but maybe A and B together are good
> >> predictor...
> >>
> >> Any suggestion is welcomed
> > 
> > 
> > milicic.marko
> > 
> > I think do you start with a rpart("binary variable"~.)
> > This show you a set of variables to start a model and the start set to
> > curoff  for continous variables
> 
> I cannot imagine a worse way to formulate a regression model.  Reasons 
> include
> 
> 1. Results of recursive partitioning are not trustworthy unless the 
> sample size exceeds 50,000 or the signal to noise ratio is extremely high.
> 
> 2. The type I error of tests from the final regression model will be 
> extraordinarily inflated.
> 
> 3. False interactions will appear in the model.
> 
> 4. The cutoffs so chosen will not replicate and in effect assume that 
> covariate effects are discontinuous and piecewise flat.  The use of 
> cutoffs results in a huge loss of information and power and makes the 
> analysis arbitrary and impossible to interpret (e.g., a high covariate 
> value:low covariate value odds ratio or mean difference is a complex 
> function of all the covariate values in the sample).
> 
> 5. The model will not validate in new data.

Professor Frank,

Thank you for your explain.

Well, if my first idea is wrong what is your opinion on the following
approach?

1- Make PCA with data excluding the binary variable
2- Put de principal components in logistic model
3- After revert principal componentes in variable (only if is
interesting for milicic.marko)

If this approach is wrong too what is your approach?
-- 
Bernardo Rangel Tura, M.D,MPH,Ph.D
National Institute of Cardiology
Brazil



More information about the R-help mailing list