[Rd] nobs() with glm(family="poisson")

Thu Feb 28 22:59:43 CET 2013

Le mercredi 27 février 2013 à 18:59 -0500, John Fox a écrit :
> Dear Milan and Steven,
> 
> At the risk of muddying the water further, I think that the potential
> confusion here is that Poisson GLMs are applied in two formally
> equivalent but substantively different situations: (1) where the
> counts are cells in a contingency table, in which case the Poisson GLM
> is used to fit an equivalent loglinear association model to the table;
> and (2) where the counts are observations on a non-negative integer
> response -- what's often called "Poisson regression." In the first
> case, but not the second, it makes sense to think of the sum of the
> counts as the natural sample size. I don't think that one can expect
> GLM software to distinguish these cases.
Thanks, that sounds like a good summary of the situation. There are two
legitimate use cases and definitions of number of observations, yet only
one nobs() and one BIC() function.

Indeed software can hardly find out which situation applies, except
maybe if it is possible to consider (as I suggested above) that
log-linear models are always fitted to tabular data (several
observations per cell), while poisson regressions are fitted to data
frames (one observation per row). If this is right, then a possible
solution would be to define nobs.glm() like this:

nobs.glm <- function(object, ...) {
    w <- object$prior.weights

    if(is.matrix(object$data)) {
        if (!is.null(w)) sum(object$data[w != 0], na.rm=TRUE)
        else sum(object$data, na.rm=TRUE)
    }
    else {
        if (!is.null(w)) sum(w != 0) else length(object$residuals)
    }
}

This would just require glm() to call as.data.frame(data) when passed a
table.

(loglin() could be considered the most natural way of fitting log-linear
models, but glm() is very useful too since it supports the quasipoisson
family, and the negative binomial via glm.nb(); finally, glm() handles
structural zeros better than loglin().)

Regards

> Best,
>  John
> 
> > -----Original Message-----
> > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> > project.org] On Behalf Of Milan Bouchet-Valat
> > Sent: Wednesday, February 27, 2013 5:59 PM
> > To: Steven McKinney
> > Cc: r-devel
> > Subject: Re: [Rd] nobs() with glm(family="poisson")
> > 
> > Le mercredi 27 février 2013 à 14:26 -0800, Steven McKinney a écrit :
> > >
> > > > -----Original Message-----
> > > > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> > project.org]
> > > > On Behalf Of Milan Bouchet-Valat
> > > > Sent: February-27-13 12:56 PM
> > > > To: peter dalgaard
> > > > Cc: r-devel
> > > > Subject: Re: [Rd] nobs() with glm(family="poisson")
> > > >
> > > > Thanks for the (critical, indeed) answer!
> > > >
> > > > Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit :
> > > > > On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:
> > > > >
> > > > > > I cannot believes nobody cares about this -- or I'm completely
> > wrong
> > > > and
> > > > > > in that case everybody should rush to put the shame on me... :-
> > p
> > > > >
> > > > > Well, nobs() is the number of observations. If you have 5 Poisson
> > > > > distributed counts, you have 5 observations.
> > > > Well, say that to the statistical offices that spend millions to
> > survey
> > > > thousands of people with correct (but complex) sampling designs,
> > they'll
> > > > be happy to know that the collected data only provides an
> > information
> > > > equivalent to 5 independent outcomes. ;-)
> > >
> > > Milan:
> > >
> > > It seems to me you are mixing up Binomial and Poisson situations,
> > > and not assessing independence appropriately.
> > >
> > > The above example discusses Bernoulli outcomes which are sometimes
> > > aggregated into Binomial "cases" depending on the study design.
> > > Now if the survey samples people in the same household or even
> > > neighbourhood, those Bernoulli outcomes will not be independent
> > > (hence clustered survey techniques) and summing the Binomial
> > > denominators would not be appropriate, for the survey analysis or
> > > for BIC calculations.  The "n" in the BIC calculation should
> > > reflect independent observations.  If you knock on the same
> > > door 1000 times and ask the person who they will vote for,
> > > you do not have 1000 independent observations, even though
> > > your Binomial denominator is 1000.
> > My intention was not to introduce the issue of survey designs into the
> > discussion, but merely to make the point that in surveys, counts are
> > usually *to some extent at least* independent observations, even when
> > clustering is present, and that the fact that different people are
> > asked
> > and that each answer costs money is the best indication of that.
> > Anyway,
> > BIC does not apply if we are not assuming that the data comes from a
> > simple random sample, so let's leave this complication aside.
> > 
> > > The example you show from ?glm is a Poisson example showing
> > > 9 independent Poisson counts.  If I count the number of cars
> > > passing through an intersection during non-overlapping
> > > one minute intervals (say 9 such intervals), then the number
> > > of observations I have is the number of non-overlapping
> > > one minute interval car count totals (e.g. the nine counts
> > > c(18, 17, 15, 20, 10, 20, 25, 13, 12)), not the number of
> > > cars I saw in total.
> > Interesting. Indeed in the observation setting you describe, 9 is
> > AFAICT
> > the correct number of observations. Is this kind of data commonly
> > fitted
> > using glm()?
> > 
> > Do you happen to possess a copy of the book where the ?glm example
> > comes
> > from? There are not many of them here in France so I cannot consult it
> > easily. It seems to me that in the context of a randomized controlled
> > trial, the number of independent observations is the number of
> > subjects,
> > not the number of groups. And thus, BIC() would still return a wrong
> > value for the ?glm example.
> > 
> > > A piece of software that adds things up can not know the
> > > context from which the numbers were derived, so you have to
> > > figure out the level of independence appropriate to your
> > > study design and work out the BIC count accordingly.
> > This is a strong argument indeed. It would mean that BIC() is at best a
> > function of very limited use, or even a dangerous one, unless one can
> > safely assume that the case were the number of observations equals the
> > number of rows in the data is by far the most common one. I am biased
> > due to my use of log-linear models, but I doubt this is the case. Is it
> > (I might perfectly be wrong)?
> > 
> > > Raftery alludes to this in a preceding section:
> > >
> > > "When the data have been collected using a complex survey
> > > design with resulting weights, it is not yet clear what n
> > > should be, and this issue awaits further study.  However,
> > > it seems reasonable that if the model is based on an
> > > assumption of simple random sampling but the sampling
> > > design is less efficient, then n should be reduced to
> > > reflect the efficiency of the sampling design relative to
> > > simple random sampling."
> > I think Raftery had in mind surveys in which the assumption of
> > independence between observations (counts, not rows) does not hold, but
> > where it is still the reference from which the sample deviates (lower
> > "efficiency"). In this case, the number of cells/rows by no means a
> > good
> > measure of the number of observations either -- but as I said BIC is
> > usually considered as not defined in this case.
> > 
> > 
> > Thanks for sharing your remarks.
> > 
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel