[Rd] nobs() with glm(family="poisson")

Wed Feb 27 21:55:57 CET 2013

Thanks for the (critical, indeed) answer!

Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit :
> On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:
> 
> > I cannot believes nobody cares about this -- or I'm completely wrong and
> > in that case everybody should rush to put the shame on me... :-p
> 
> Well, nobs() is the number of observations. If you have 5 Poisson
> distributed counts, you have 5 observations.
Well, say that to the statistical offices that spend millions to survey
thousands of people with correct (but complex) sampling designs, they'll
be happy to know that the collected data only provides an information
equivalent to 5 independent outcomes. ;-)

> If the number of observations is not the right thing to use in some
> context, use the right thing instead. Changing the definition of
> nobs() surely leads to madness. 
It is common usage in the literature using log-linear models to report
the sum of counts as the number of observations. I think this indeed
makes sense, but I'm not particularly attached to the choice of words --
let's call it as you please.

The root issue is that nobs() was precisely introduced to be the basis
for the BIC() function, as ?nobs states explicitly:
>      Extract the number of ‘observations’ from a model fit.  This is
>      principally intended to be used in computing BIC (see ‘AIC’)

So it's OK to say that the number of observations is the number of cells
(even if I think this is not very user-friendly), but then the
documentation is misleading, and the BIC() function returns incorrect
values for the very first example provided in ?glm.

> (I suppose that the fact that n is so obviously the wrong thing for
> one particularly well-digested family of distribution functions could
> be taken to indicate a generic weakness with the BIC.)
I'm sure we can agree on the fact that BIC has its weaknesses (and I'm
not the best person able to judge), but the point at stake is IMHO not
one of them. After all, usual statistics for the Poisson family, such as
deviance or residuals, are based on the sum of counts, not on the number
of cells, and nobody objects.

The fact that the AIC is perfectly integrated into S/R and BIC seems to
be merely an historical detail to me. Computing the AIC itself requires
glm.fit() to add twice the equivalent degrees of freedom to the value
returned by the family function, so why would an equivalent
special-casing of BIC be the sign of an intrinsic statistical
deficiency? Maybe the BIC is not a good indicator, but technical
arguments are somewhat secondary in that debate.

Of course, if BIC is a bad indicator, BIC() should probably be
discouraged in the documentation, and print a warning when the returned
value is known to be incorrect.

Regards

> > In the meantime, I have come up with an alternative way of fixing this:
> > when modeling count data, glm() could allow users to pass a table as the
> > data argument, and convert it to a data frame using
> > as.data.frame.table() instead of requiring the user to do it
> > beforehand[1]. This would become the recommended way of fitting models
> > for count data, and the fact that a table is passed could be used as the
> > sign that nobs() should return the sum of cell counts instead of the
> > number of rows in the data.frame.
> > 
> > 
> > Regards
> > 
> > 
> > 1: gnm already supports this pattern, with the additional advantage that
> > e.g. fitted(), predict(), residuals() and weights() return an object of
> > the same dimensions and dimnames as the original table.
> > 
> > 
> > Le lundi 18 février 2013 à 12:22 +0100, Milan Bouchet-Valat a écrit :
> >> Hi!
> >> 
> >> The nobs() method for glm objects always returns the number of cases
> >> with non-null weights in the data, which does not correspond to the
> >> number of observations for Poisson regression/log-linear models, i.e.
> >> when family="poisson" or family="quasipoisson".
> >> 
> >> This sounds dangerous since nobs() is, as the documentation states,
> >> primarily aimed at computing the Bayesian information criterion. Raftery
> >> (1995:20) warned against this:
> >>> What should n be? Once again, it is best to use the actual number of
> >>> individuals, i.e. the sum of the cell counts, and not the number of
> >>> cells (Raftery, 1986a).
> >> 
> >> Is there a reason why this should not/cannot be done that way?
> >> 
> >> This behavior can be reproduced with with R 3.0.0 from SVN, using the
> >> example from ?glm:
> >> counts <- c(18,17,15,20,10,20,25,13,12)
> >> outcome <- gl(3,1,9)
> >> treatment <- gl(3,3)
> >> glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
> >> nobs(glm.D93)
> >> # 9 == length(counts)
> >> # Should be 150 == sum(counts)
> >> 
> >> FWIW, stats:::nobs.glm is currently defined as:
> >> nobs.glm <- function (object, ...) 
> >>    if (!is.null(w <- object$prior.weights)) sum(w != 0) else length(object$residuals)
> >> 
> >> 
> >> Thanks!
> >> 
> >> 
> >> Raftery, Adrian E. 1995. “Bayesian Model Selection in Social Research.”
> >> Sociological methodology 25:111–96.
> >> 
> >> ______________________________________________
> >> R-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > 
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>