[R] Logistic regression problem

Wed Oct 1 14:19:41 CEST 2008

Bernardo Rangel Tura wrote:
> Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu:
>> Bernardo Rangel Tura wrote:
>>> Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:
>>>> I have a huge data set with thousands of variable and one binary
>>>> variable. I know that most of the variables are correlated and are not
>>>> good predictors... but...
>>>>
>>>> It is very hard to start modeling with such a huge dataset. What would
>>>> be your suggestion. How to make a first cut... how to eliminate most
>>>> of the variables but not to ignore potential interactions... for
>>>> example, maybe variable A is not good predictor and variable B is not
>>>> good predictor either, but maybe A and B together are good
>>>> predictor...
>>>>
>>>> Any suggestion is welcomed
>>>
>>> milicic.marko
>>>
>>> I think do you start with a rpart("binary variable"~.)
>>> This show you a set of variables to start a model and the start set to
>>> curoff  for continous variables
>> I cannot imagine a worse way to formulate a regression model.  Reasons 
>> include
>>
>> 1. Results of recursive partitioning are not trustworthy unless the 
>> sample size exceeds 50,000 or the signal to noise ratio is extremely high.
>>
>> 2. The type I error of tests from the final regression model will be 
>> extraordinarily inflated.
>>
>> 3. False interactions will appear in the model.
>>
>> 4. The cutoffs so chosen will not replicate and in effect assume that 
>> covariate effects are discontinuous and piecewise flat.  The use of 
>> cutoffs results in a huge loss of information and power and makes the 
>> analysis arbitrary and impossible to interpret (e.g., a high covariate 
>> value:low covariate value odds ratio or mean difference is a complex 
>> function of all the covariate values in the sample).
>>
>> 5. The model will not validate in new data.
> 
> Professor Frank,
> 
> Thank you for your explain.
> 
> Well, if my first idea is wrong what is your opinion on the following
> approach?
> 
> 1- Make PCA with data excluding the binary variable
> 2- Put de principal components in logistic model
> 3- After revert principal componentes in variable (only if is
> interesting for milicic.marko)
> 
> If this approach is wrong too what is your approach?

Hi Bernardo,

If there is a large number of potential predictors and no previous 
knowledge to guide the modeling, principal components (PC) is often an 
excellent way to proceed.  The first few PCs can be put into the model. 
  The result is not always very interpretable, but you can "decode" the 
PCs by using stepwise regression or recursive partitioning (which are 
safer in this context because the stepwise methods are not exposed to 
the Y variable).  You can also add PCs in a stepwise fashion in the 
pre-specified order of variance explained.

There are many variations on this theme including nonlinear principal 
components (e.g., the transcan function in the Hmisc package) which may 
explain more variance of the predictors.

Frank
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University