[R] Negative Binomial: glm.nb

Wed Aug 15 18:28:19 CEST 2007

Hi Folks,

I'm playing with glm.nb() in MASS.

Reference: the negative binomial distribution

  P(y) = (Gamma(theta+y)/(Gamma(theta)*y!))*(p^theta)*(1-p)^y

  y = 0,1,2,...

in the notation of the MASS book (section 7.4), where

  p = theta/(mu + theta)  so  (1-p) = mu/(mu + theta)

where mu is the expected value of Y. It seems from ?glm.nb that
an initial value of theta is either supplied, or obtained by
a method of moments after an initial Poisson fit to the data;
then it casts back and forth:

A: given theta, glm is applied to fit the linear model, with
log link, for mu. Then theta is re-estimed, then the linear model,
and so on.

I hope I've understood right!

However, this procedure if based on the assumption that the same
value of theta applies across the board, regardless of the values
of any covariates or factors.

The MASS book (Section 7.4) illustrates the use of glm.nb() on
the Quine data (for numbers of days absence from school in a
school year by Australian children). There are four factors in
addition to the  "outcome" Days:

Eth: Ethnicity (Aboriginal "A"/Non-Aboriginal "N")
Lrn: Learning Ability (Slow learner"SL", Average Learner "AL")
Age: Age Group (Primary "F0", First/Second/Third Forms "F1/2/3")
Sex: Gender ("M"/"F")

This dataset is earlier analysed thoroughly in the MASS book
(Section 6.8), primarily with reference to transforming Days
by a Box-Cox transformation. The negative binomial does not
play a role in that part.

When it comes to the glm.nb(), however, the assumption of "constant
theta" is implicit in Section 7.4.

But, if you look at the data, this does seem all that reasonable.

Preamble: Get the variables out of 'quine' into convenient names.

library(MASS)
D<-quine$Days ; E<-quine$Eth; L<-quine$Lrn; A<-quine$Age; S<-quine$Sex

D.A<-D[E=="A"]; L.A<-L[E=="A"]; A.A<-A[E=="A"]; S.A<-S[E=="A"]
D.N<-D[E=="N"]; L.N<-L[E=="N"]; A.N<-A[E=="N"]; S.N<-S[E=="N"]

Now, if you look at the histograms of D.A and D.N separately:

  hist(D.A,breaks=-0.5+4*(0:22))
  hist(D.N,breaks=-0.5+4*(0:22))

you can see that there is a major difference in shape, such as
would arise if theta<1 for Eth="A", and theta>1 for Eth="B".
This is confirmed by a separate covariate-free fit of the mean
to each group, from which the fitted theta is returned:

summary(glm.nb(D.A~1))
-->           Theta:  1.499
          Std. Err.:  0.260

summary(glm.nb(D.A~1))
-->           Theta:  0.919
          Std. Err.:  0.159

(1.499-0.919)/sqrt(0.260^2 + 0.159^2)
##[1] 1.903113

1-pnorm(1.903113)
##[1] 0.0285129

so we have a 1-sided P-value of 0.029, a 2-sided of 0.057

That's a fairly strong indication that theta is not the same
for the two groups; and the shapes of the histograms show that
the difference has an important effect on the distribution.

Of course, this dataset is notoriously unbalanced, so maybe this
is a reflection of the mixtures of categories. A simple linear
model on each subgroup eases tha above situation a bit:

summary(glm.nb(D.A ~ S.A+A.A+L.A))
-->           Theta:  1.767
          Std. Err.:  0.318
summary(glm.nb(D.N ~ S.N+A.N+L.N))
              Theta:  1.107
          Std. Err.:  0.201
1-pnorm((1.767-1.107)/sqrt(0.318^2+0.201^2)) = 0.0397

and throwing everything into the pot eases it a bit more:

summary(glm.nb(D.A ~ S.A*A.A*L.A))
-->           Theta:  2.317
          Std. Err.:  0.438
summary(glm.nb(D.N ~ S.N*A.N*L.N))
-->           Theta:  1.581
          Std. Err.:  0.318
1-pnorm((2.317-1.581)/sqrt(0.438^2+0.318^2)) = 0.0870

But, while the theta-values have now moved well up into the ">1"
range (where they have less effect on the shape of the distribution),
nevertheless the differences of the "A" and "N" estimates have
not changed much: 0.580 --> 0.660 --> 0.736
even though their P-values have eased off somwehat (and, with 69
and 77 data for the "A" and "N" groups respectively, we don't have
huge power here, I think).

I've gone through the above in detail, since it's an illustration
which you cam all work through yourselves in R, to illustrate the
issue about variation of theta in a negative binomial model, when
you want to get at the effects of covariates on the outcome Y (the
counts of events). In real life I'm considering other data where
the "theta" effect is more marked, and probably more important.
Which finally leads on to my

QUERY: Is there anything implement{ed|able} in R which can address
the issue that theta may vary according to factors or covariates,
as well as the level of mu given theta?

Or is one reduced to running glm.nb() (or the like) on subsets?
(Not that this would be much use if theta depended on a continuous
covariate).

With thanks, and best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 15-Aug-07                                       Time: 17:28:14
------------------------------ XFMail ------------------------------