[R-sig-ME] Model validation for Presence / Absence, (binomial) GLMs

Fri Jul 5 13:34:39 CEST 2013

On 05/07/2013 12:48 PM, Ken Knoblauch wrote:
> Ben Bolker <bbolker at ...> writes:
>> Highland Statistics Ltd <highstat <at> ...> writes:
>>>>> This is something I always battle with given the
> plethora of great model
>>>>> fitting methods available for other models.
>>>>>
>>>>> I always use a variant of Hugh's suggestion and
>   look at the % of correct
>>>>> predictions between models as a quick model
> fitting statistic.
>>>>> And for overdispersion I believe one way is to fit
> individual level random
>>>>> effects and see if this is a substantively better model.
> There is more on
>>>>> this in the wiki http://glmm.wikidot.com/faq
>>>>     Yes, but this is unidentifiable for Bernoulli
> responses (as also
>>>> explained there).
>>> The statement on  'unidentifiable for Bernoulli
>>> responses'....well...apparently this is not that trivial.
>>> See:  http://www.highstat.com/BGGLM.htm
>>>
>>> Follow the link to: the Discussion Board....
>>> Go to: Chapter 1 Introduction to generalized linear
> models
>>> And see the topic: Can binary logistic models be
> overdispersed?
>>> Alain
>>    That's an interesting document: I think the bottom line is:
>>
>>    * if the Bernoulli data can be grouped, i.e. if there are
>> in general multiple observations with the same set of
> covariates,
>> then overdispersion can be identified, because the data are
>> really equivalent to a binomial response within the groups.
>>
>>    For example, the trivial example
>>
>> grp  resp
>> A    1
>> A    0
>> A    1
>> B    0
>> B    0
>> B    1
>>
>> is equivalent to:
>>
>> grp  successes total
>> A    2         3
>> B    1         3
>>
> Agreed that this is very interesting but still a bit mysterious
> as everything looks the same on the surface.
> The likelihoods only differ by the log of the binomial coefficients
> as can easily be verified on Ben's example above and as expected
> from the likelihood equations:
>
> Grpd <- read.table(
> textConnection("grp  resp
> A    1
> A    0
> A    1
> B    0
> B    0
> B    1"), TRUE)
>
> UnGrpd <- read.table(
> textConnection("grp  successes total
> A    2         3
> B    1         3"), TRUE)
>
> -logLik(glm(resp ~ grp, binomial, Grpd)) +
> logLik(glm(cbind(successes, total - successes) ~ grp, binomial, UnGrpd))
>
> with(UnGrpd, sum(log(choose(total, successes))))
>
> However, looking at the outputs of the glm, the degrees of freedom
> differ, being 4 on the binary responses and 0 for the binomial response.
> Should degrees of freedom really be computed differently in the two cases
> since it is easy to transform the two cases back and forth?
> And, if so, what does that mean?
>
> Ken
>

The issue of DoF seems similar in assessing Goodness of Fit for logistic 
regression, which can be done with grouped data (or by defining bins as 
in Hosmer's test, which is based on the same idea of pooling data).

I would have to look back at the references below to see how these 
goodness-of-ift tests are affected by overdispersion.

Hosmer et al. (91) The Importance of Assessing the Fit of Logistic 
Regression Models: A Case Study. Americal Journal of Public Health, 
December 1991, Vol. 81, No. 12

Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic 
regression
model. Commu in Stat. 1980;A10:1043-1069.

Lemeshow S, Hosmer DW. A review of goodness-of-fit statistics for use in 
the development of logistic regression models.Am JEpidemioL 
1982;115:92-106.

Hosmer DW, Lemeshow S, Klar J. Goodness-of-fit testing for multiple 
logistic regression analysis when the estimated probabilities are small. 
Biometrcal J. 1988;30(7):1-14.

Gabriel

-- 
---------------------------------------------------------------------
Gabriel Baud-Bovy               tel.: (+39) 02 2643 4839 (office)
UHSR University                       (+39) 02 2643 3429 (laboratory)
via Olgettina, 58                     (+39) 02 2643 4891 (secretary)
20132 Milan, Italy               fax: (+39) 02 2643 4892