[Rd] nobs() with glm(family="poisson")

peter dalgaard pdalgd at gmail.com
Wed Feb 27 20:48:26 CET 2013

On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:

> I cannot believes nobody cares about this -- or I'm completely wrong and
> in that case everybody should rush to put the shame on me... :-p

Well, nobs() is the number of observations. If you have 5 Poisson distributed counts, you have 5 observations. If the number of observations is not the right thing to use in some context, use the right thing instead. Changing the definition of nobs() surely leads to madness. 

(I suppose that the fact that n is so obviously the wrong thing for one particularly well-digested family of distribution functions could be taken to indicate a generic weakness with the BIC.)

> In the meantime, I have come up with an alternative way of fixing this:
> when modeling count data, glm() could allow users to pass a table as the
> data argument, and convert it to a data frame using
> as.data.frame.table() instead of requiring the user to do it
> beforehand[1]. This would become the recommended way of fitting models
> for count data, and the fact that a table is passed could be used as the
> sign that nobs() should return the sum of cell counts instead of the
> number of rows in the data.frame.
> Regards
> 1: gnm already supports this pattern, with the additional advantage that
> e.g. fitted(), predict(), residuals() and weights() return an object of
> the same dimensions and dimnames as the original table.
> Le lundi 18 février 2013 à 12:22 +0100, Milan Bouchet-Valat a écrit :
>> Hi!
>> The nobs() method for glm objects always returns the number of cases
>> with non-null weights in the data, which does not correspond to the
>> number of observations for Poisson regression/log-linear models, i.e.
>> when family="poisson" or family="quasipoisson".
>> This sounds dangerous since nobs() is, as the documentation states,
>> primarily aimed at computing the Bayesian information criterion. Raftery
>> (1995:20) warned against this:
>>> What should n be? Once again, it is best to use the actual number of
>>> individuals, i.e. the sum of the cell counts, and not the number of
>>> cells (Raftery, 1986a).
>> Is there a reason why this should not/cannot be done that way?
>> This behavior can be reproduced with with R 3.0.0 from SVN, using the
>> example from ?glm:
>> counts <- c(18,17,15,20,10,20,25,13,12)
>> outcome <- gl(3,1,9)
>> treatment <- gl(3,3)
>> glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
>> nobs(glm.D93)
>> # 9 == length(counts)
>> # Should be 150 == sum(counts)
>> FWIW, stats:::nobs.glm is currently defined as:
>> nobs.glm <- function (object, ...) 
>>    if (!is.null(w <- object$prior.weights)) sum(w != 0) else length(object$residuals)
>> Thanks!
>> Raftery, Adrian E. 1995. “Bayesian Model Selection in Social Research.”
>> Sociological methodology 25:111–96.
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

More information about the R-devel mailing list