[Rd] nobs() with glm(family="poisson")

Wed Feb 27 23:26:10 CET 2013

> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org]
> On Behalf Of Milan Bouchet-Valat
> Sent: February-27-13 12:56 PM
> To: peter dalgaard
> Cc: r-devel
> Subject: Re: [Rd] nobs() with glm(family="poisson")
> 
> Thanks for the (critical, indeed) answer!
> 
> Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit :
> > On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:
> >
> > > I cannot believes nobody cares about this -- or I'm completely wrong
> and
> > > in that case everybody should rush to put the shame on me... :-p
> >
> > Well, nobs() is the number of observations. If you have 5 Poisson
> > distributed counts, you have 5 observations.
> Well, say that to the statistical offices that spend millions to survey
> thousands of people with correct (but complex) sampling designs, they'll
> be happy to know that the collected data only provides an information
> equivalent to 5 independent outcomes. ;-)

Milan:

It seems to me you are mixing up Binomial and Poisson situations,
and not assessing independence appropriately.

The above example discusses Bernoulli outcomes which are sometimes
aggregated into Binomial "cases" depending on the study design.
Now if the survey samples people in the same household or even
neighbourhood, those Bernoulli outcomes will not be independent
(hence clustered survey techniques) and summing the Binomial
denominators would not be appropriate, for the survey analysis or
for BIC calculations.  The "n" in the BIC calculation should
reflect independent observations.  If you knock on the same
door 1000 times and ask the person who they will vote for,
you do not have 1000 independent observations, even though
your Binomial denominator is 1000.

The example you show from ?glm is a Poisson example showing
9 independent Poisson counts.  If I count the number of cars
passing through an intersection during non-overlapping
one minute intervals (say 9 such intervals), then the number 
of observations I have is the number of non-overlapping 
one minute interval car count totals (e.g. the nine counts
c(18, 17, 15, 20, 10, 20, 25, 13, 12)), not the number of 
cars I saw in total.

A piece of software that adds things up can not know the
context from which the numbers were derived, so you have to
figure out the level of independence appropriate to your
study design and work out the BIC count accordingly.

Raftery alludes to this in a preceding section:

"When the data have been collected using a complex survey 
design with resulting weights, it is not yet clear what n 
should be, and this issue awaits further study.  However, 
it seems reasonable that if the model is based on an 
assumption of simple random sampling but the sampling 
design is less efficient, then n should be reduced to 
reflect the efficiency of the sampling design relative to 
simple random sampling."

Steven McKinney
Statistician
Molecular Oncology and Breast Cancer Program
British Columbia Cancer Research Centre

> 
> > If the number of observations is not the right thing to use in some
> > context, use the right thing instead. Changing the definition of
> > nobs() surely leads to madness.
> It is common usage in the literature using log-linear models to report
> the sum of counts as the number of observations. I think this indeed
> makes sense, but I'm not particularly attached to the choice of words --
> let's call it as you please.
> 
> The root issue is that nobs() was precisely introduced to be the basis
> for the BIC() function, as ?nobs states explicitly:
> >      Extract the number of ‘observations’ from a model fit.  This is
> >      principally intended to be used in computing BIC (see ‘AIC’)
> 
> So it's OK to say that the number of observations is the number of cells
> (even if I think this is not very user-friendly), but then the
> documentation is misleading, and the BIC() function returns incorrect
> values for the very first example provided in ?glm.
> 
> > (I suppose that the fact that n is so obviously the wrong thing for
> > one particularly well-digested family of distribution functions could
> > be taken to indicate a generic weakness with the BIC.)
> I'm sure we can agree on the fact that BIC has its weaknesses (and I'm
> not the best person able to judge), but the point at stake is IMHO not
> one of them. After all, usual statistics for the Poisson family, such as
> deviance or residuals, are based on the sum of counts, not on the number
> of cells, and nobody objects.
> 
> The fact that the AIC is perfectly integrated into S/R and BIC seems to
> be merely an historical detail to me. Computing the AIC itself requires
> glm.fit() to add twice the equivalent degrees of freedom to the value
> returned by the family function, so why would an equivalent
> special-casing of BIC be the sign of an intrinsic statistical
> deficiency? Maybe the BIC is not a good indicator, but technical
> arguments are somewhat secondary in that debate.
> 
> 
> Of course, if BIC is a bad indicator, BIC() should probably be
> discouraged in the documentation, and print a warning when the returned
> value is known to be incorrect.
> 
> 
> Regards
> 
> > > In the meantime, I have come up with an alternative way of fixing this:
> > > when modeling count data, glm() could allow users to pass a table as
> the
> > > data argument, and convert it to a data frame using
> > > as.data.frame.table() instead of requiring the user to do it
> > > beforehand[1]. This would become the recommended way of fitting models
> > > for count data, and the fact that a table is passed could be used as
> the
> > > sign that nobs() should return the sum of cell counts instead of the
> > > number of rows in the data.frame.
> > >
> > >
> > > Regards
> > >
> > >
> > > 1: gnm already supports this pattern, with the additional advantage
> that
> > > e.g. fitted(), predict(), residuals() and weights() return an object of
> > > the same dimensions and dimnames as the original table.
> > >
> > >
> > > Le lundi 18 février 2013 à 12:22 +0100, Milan Bouchet-Valat a écrit :
> > >> Hi!
> > >>
> > >> The nobs() method for glm objects always returns the number of cases
> > >> with non-null weights in the data, which does not correspond to the
> > >> number of observations for Poisson regression/log-linear models, i.e.
> > >> when family="poisson" or family="quasipoisson".
> > >>
> > >> This sounds dangerous since nobs() is, as the documentation states,
> > >> primarily aimed at computing the Bayesian information criterion.
> Raftery
> > >> (1995:20) warned against this:
> > >>> What should n be? Once again, it is best to use the actual number of
> > >>> individuals, i.e. the sum of the cell counts, and not the number of
> > >>> cells (Raftery, 1986a).
> > >>
> > >> Is there a reason why this should not/cannot be done that way?
> > >>
> > >> This behavior can be reproduced with with R 3.0.0 from SVN, using the
> > >> example from ?glm:
> > >> counts <- c(18,17,15,20,10,20,25,13,12)
> > >> outcome <- gl(3,1,9)
> > >> treatment <- gl(3,3)
> > >> glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())
> > >> nobs(glm.D93)
> > >> # 9 == length(counts)
> > >> # Should be 150 == sum(counts)
> > >>
> > >> FWIW, stats:::nobs.glm is currently defined as:
> > >> nobs.glm <- function (object, ...)
> > >>    if (!is.null(w <- object$prior.weights)) sum(w != 0) else
> length(object$residuals)
> > >>
> > >>
> > >> Thanks!
> > >>
> > >>
> > >> Raftery, Adrian E. 1995. “Bayesian Model Selection in Social
> Research.”
> > >> Sociological methodology 25:111–96.
> > >>
> > >> ______________________________________________
> > >> R-devel at r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> > > ______________________________________________
> > > R-devel at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel