[R] Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")

Wed Jun 6 17:43:43 CEST 2012

On Jun 6, 2012, at 9:36 AM, peter dalgaard wrote:

> 
> On Jun 6, 2012, at 10:59 , lincoln wrote:
> 
>> 
>> David Winsemius wrote
>>> 
>>> This is making me think you really have multiple observation on the  
>>> same individuals (and that persons make transitions from one state to  
>>> another as a result of the passage of time. That needs a more complex  
>>> analysis than "simple" logistic regression. You might consider posting  
>>> a more complete description of the study on the SIG Mixed Effects  
>>> mailing list.
>>> 
>>> -- 
>>> David.
>>> 
>> 
>> No, I haven't. Individuals are birds marked with an unique alphanumeric code
>> that gives me information on their gender (sometimes I have this data
>> sometime I haven't), and their birth date (as a consequence also the age).
>> There are no multiple observations of the same individual.
>> 
>> Anyway, I believe I have not been answered to the main question: when using
>> anova with test "Chisq" between two models, is the difference in deviance
>> between the two models interpretable as the Chi Square value and the
>> difference in df interpretable as the df of the Chi square test?
>> 
>> For instance, given:
>> 
>>> anova(mod4,update(mod4,~.-cohort),test="Chisq")
>> Analysis of Deviance Table
>> 
>> Model 1: site ~ cohort
>> Model 2: site ~ 1
>> Resid. Df Resid. Dev Df Deviance P(>|Chi|)    
>> 1       993     1283.7                          
>> 2      1002     1368.2 -9  -84.554 2.002e-14 ***
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
>> 
>> Is 84.554 taken as the Chi square value, 9 as the df of the test and the
>> p-value depending on these two values?
> 
> That's the general mechanism, yes. (Whether the chi-square distribution holds after variable selection is a more difficult issue. Frank Harrell might chime in and remind us that there are books on that subject.)

Frank might be busy with useR preparations for next week...

Quoting from Frank's book "Regression Modeling Strategies", page 58, in the context of variable selection, stepwise methods and stopping rules:

"The residual $\chi^2$ can be tested for significance (if one is able to forget that because of variable selection this statistic does not have a $\chi^2$ distribution), or the stopping rule can be based on Akaike's information criterion (AIC), here residual $\chi^2$ - 2 x d.f. Of course, use of more insight from knowledge of the subject matter will generally improve the modeling process substantially. It must be remembered that no currently available stopping rule was developed for data driven variable selection. Stopping rules such as AIC or Mallows' $C_p$ are intended for comparing only two \emph{prespecified} models."

The entire chapter (4) discusses these issues in more detail and as Peter notes there are other books and papers that focus on the underlying issue of variable selection. As Frank is oft-quoted as saying:

"Variable selection is hazardous both to inference and to prediction. There is no free lunch; we are torturing data to confess its own sins."

Going back to Lincoln's prior post in the thread, presuming that there is sufficient data to use the original pre-specified model and also that the original full model itself was not derived from prior variable selection or univariate pre-screening:

  mod1 <- glm(site ~ sex + birth + cohort + sex:birth, data=datasex, family = binomial) 

I would recommend reviewing the likelihood ratio test for that model versus the null model:

  anova(mod1, test = "Chisq")

and determine whether or not 'cohort' was significant at some level there, rather than in the final reduced model. You might also want to consider using some of the tools in Frank's rms package on CRAN to further evaluate/validate that model.

Regards,

Marc Schwartz