[Rd] Apparant bug in binomial model in GLM (PR#13434)

Wed Jan 7 17:40:45 CET 2009

On Wed, Jan 07, 2009 at 04:48:03PM +0100, Peter Dalgaard wrote:
> Charles Geyer wrote:
> ...
> > BTW the particular example given doesn't make clear WHAT question cannot
> > be answered correctly.  Some questions can be without fuss, for example
> > 
> >> y <- c(0,0,0,0,0,1,1,1,1,1)
> >> x <- seq(along = y)
> >> out1 <- glm(y ~ x, family = binomial)
> > Warning messages:
> > 1: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart,  :
> >   algorithm did not converge
> > 2: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart,  :
> >   fitted probabilities numerically 0 or 1 occurred
> >> out0 <- glm(y ~ 1, family = binomial)
> >> anova(out0, out1, test = "Chisq")
> > Analysis of Deviance Table
> > 
> > Model 1: y ~ 1
> > Model 2: y ~ x
> >   Resid. Df Resid. Dev Df Deviance P(>|Chi|)
> > 1         9    13.8629
> > 2         8  7.865e-10  1  13.8629    0.0002
> > 
> > This P-value (P = 0.0002) is valid, because the MLE does exist for the null
> > hypothesis.  Hence we see that we have to use the model y ~ x for which
> > the MLE does not exist in the conventional sense.
> 
> 
> It may be  valid in some senses, but I can't help notice that it is off
> by a factor of at least 10, since the experiment has only 1024 outcomes,
> two of which are as extreme as the one observed, and where all outcomes
> are equally likely under the corresponding y~1 model.

It's valid in the sense that all of the P-values R produces for GLM are valid.
It's an asymptotic approximation.  As with all asymptotic approximations,
at best only the absolute error is small, not the relative error.  This
is no worse than any other P-value produced by anova.glm.  And the conclusion
that P < 0.05 or P < 0.01 is correct.

You already know all this, so why the e-mail?

Worst case, a P-value produced by anova.glm can be very questionable when
"n" is too small.  With n = 10 here, it's amazing that it does as well as it
does.  That's because the MLE in the null hypothesis says all of the
response variables have the symmetric binomial distribution, and the CLT does
work, more or less, down to n = 10 for the symmetric binomial distribution.

If you don't trust the asymptotics, then do a parametric bootstrap.  That's
trivial in R.

My point wasn't about the validity of asymptotics.  My point was that either
the asymptotic test done by anova.glm or the parametric bootstrap makes sense
only when the MLE exists for the null hypothesis.  Otherwise one has to follow
the procedures in my tech report.

-- 
Charles Geyer
Professor, School of Statistics
University of Minnesota
charlie at stat.umn.edu