[R] Extreme AIC or BIC values in glm(), logistic regression

Thu Mar 19 08:57:49 CET 2009

On Thu, 19 Mar 2009, Maggie Wang wrote:

> Dear Thomas,
>
> Thank you very much for the answering!
>
> Yet why the situation happens only on some model, not all models? -
> that is, why for other model it can drop some variables but for this
> one it can't?

Presumably the other models don't have perfect separation.  If you don't have enough data for reliable estimation you will get many models that predict poorly and a few that predict extremely well, just by chance.

      -thomas

> Thanks!!
>
> Best regards,
> Maggie
>
>
>
> On Wed, Mar 18, 2009 at 3:38 PM, Thomas Lumley <tlumley at u.washington.edu> wrote:
>>
>> With 30 variables and only 55 residual degrees of freedom you probably have
>> perfect separation due to not having enough data.  Look at the coefficients
>> -- they are infinite, implying perfect overfitting.
>>
>>      -thomas
>>
>> On Wed, 18 Mar 2009, Maggie Wang wrote:
>>
>>> Dear R-users,
>>>
>>> I use glm() to do logistic regression and use stepAIC() to do stepwise
>>> model
>>> selection.
>>>
>>> The common AIC value comes out is about 100, a good fit is as low as
>>> around
>>> 70. But for some model, the AIC went to extreme values like 1000. When I
>>> check the P-values, All the independent variables (about 30 of them)
>>> included in the equation are very significant, which is impossible,
>>> because
>>> we expect some would be dropped.  This situation is not uncommon.
>>>
>>> A summary output like this:
>>>
>>> Coefficients:
>>>                             Estimate Std. Error   z value Pr(>|z|)
>>> (Intercept)                   4.883e+14  1.671e+07  29217415   <2e-16 ***
>>> g761                         -5.383e+14  9.897e+07  -5438529   <2e-16 ***
>>> g2809                        -1.945e+15  1.082e+08 -17977871   <2e-16 ***
>>> g3106                        -2.803e+15  9.351e+07 -29976674   <2e-16 ***
>>> g4373                        -9.272e+14  6.534e+07 -14190077   <2e-16 ***
>>> g4583                        -2.279e+15  1.223e+08 -18640563   <2e-16 ***
>>> g761:g2809                   -5.101e+14  4.693e+08  -1086931   <2e-16 ***
>>> g761:g3106                   -3.399e+16  6.923e+08 -49093218   <2e-16 ***
>>> g2809:g3106                   3.016e+15  6.860e+08   4397188   <2e-16 ***
>>> g761:g4373                    3.180e+15  4.595e+08   6920270   <2e-16 ***
>>> g2809:g4373                  -5.184e+15  4.436e+08 -11685382   <2e-16 ***
>>> g3106:g4373                   1.589e+16  2.572e+08  61788148   <2e-16 ***
>>> g761:g4583                   -1.419e+16  8.199e+08 -17303033   <2e-16 ***
>>> g2809:g4583                  -2.540e+16  8.151e+08 -31156781   <2e-16 ***
>>> ........
>>> (omit)
>>> ........
>>>
>>> f. codes:  0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1
>>>
>>> (Dispersion parameter for binomial family taken to be 1)
>>>
>>>  Null deviance:  120.32  on 86  degrees of freedom
>>> Residual deviance: 1009.22  on 55  degrees of freedom
>>> AIC: 1073.2
>>>
>>> Number of Fisher Scoring iterations: 25
>>>
>>> Could anyone suggest what does this mean?   How can I perform a reliable
>>> logistic regression?
>>>
>>> Thank you so much for the help!
>>>
>>> Best Regards,
>>> Maggie
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>>
>>
>> Thomas Lumley                   Assoc. Professor, Biostatistics
>> tlumley at u.washington.edu        University of Washington, Seattle
>>
>>
>>
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle