[Rd] nobs() with glm(family="poisson")

Wed Feb 27 23:58:40 CET 2013

Le mercredi 27 février 2013 à 14:26 -0800, Steven McKinney a écrit :
> 
> > -----Original Message-----
> > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org]
> > On Behalf Of Milan Bouchet-Valat
> > Sent: February-27-13 12:56 PM
> > To: peter dalgaard
> > Cc: r-devel
> > Subject: Re: [Rd] nobs() with glm(family="poisson")
> > 
> > Thanks for the (critical, indeed) answer!
> > 
> > Le mercredi 27 février 2013 à 20:48 +0100, peter dalgaard a écrit :
> > > On Feb 27, 2013, at 19:46 , Milan Bouchet-Valat wrote:
> > >
> > > > I cannot believes nobody cares about this -- or I'm completely wrong
> > and
> > > > in that case everybody should rush to put the shame on me... :-p
> > >
> > > Well, nobs() is the number of observations. If you have 5 Poisson
> > > distributed counts, you have 5 observations.
> > Well, say that to the statistical offices that spend millions to survey
> > thousands of people with correct (but complex) sampling designs, they'll
> > be happy to know that the collected data only provides an information
> > equivalent to 5 independent outcomes. ;-)
> 
> Milan:
> 
> It seems to me you are mixing up Binomial and Poisson situations,
> and not assessing independence appropriately.
> 
> The above example discusses Bernoulli outcomes which are sometimes
> aggregated into Binomial "cases" depending on the study design.
> Now if the survey samples people in the same household or even
> neighbourhood, those Bernoulli outcomes will not be independent
> (hence clustered survey techniques) and summing the Binomial
> denominators would not be appropriate, for the survey analysis or
> for BIC calculations.  The "n" in the BIC calculation should
> reflect independent observations.  If you knock on the same
> door 1000 times and ask the person who they will vote for,
> you do not have 1000 independent observations, even though
> your Binomial denominator is 1000.
My intention was not to introduce the issue of survey designs into the
discussion, but merely to make the point that in surveys, counts are
usually *to some extent at least* independent observations, even when
clustering is present, and that the fact that different people are asked
and that each answer costs money is the best indication of that. Anyway,
BIC does not apply if we are not assuming that the data comes from a
simple random sample, so let's leave this complication aside.

> The example you show from ?glm is a Poisson example showing
> 9 independent Poisson counts.  If I count the number of cars
> passing through an intersection during non-overlapping
> one minute intervals (say 9 such intervals), then the number 
> of observations I have is the number of non-overlapping 
> one minute interval car count totals (e.g. the nine counts
> c(18, 17, 15, 20, 10, 20, 25, 13, 12)), not the number of 
> cars I saw in total.
Interesting. Indeed in the observation setting you describe, 9 is AFAICT
the correct number of observations. Is this kind of data commonly fitted
using glm()?

Do you happen to possess a copy of the book where the ?glm example comes
from? There are not many of them here in France so I cannot consult it
easily. It seems to me that in the context of a randomized controlled
trial, the number of independent observations is the number of subjects,
not the number of groups. And thus, BIC() would still return a wrong
value for the ?glm example.

> A piece of software that adds things up can not know the
> context from which the numbers were derived, so you have to
> figure out the level of independence appropriate to your
> study design and work out the BIC count accordingly.
This is a strong argument indeed. It would mean that BIC() is at best a
function of very limited use, or even a dangerous one, unless one can
safely assume that the case were the number of observations equals the
number of rows in the data is by far the most common one. I am biased
due to my use of log-linear models, but I doubt this is the case. Is it
(I might perfectly be wrong)?

> Raftery alludes to this in a preceding section:
> 
> "When the data have been collected using a complex survey 
> design with resulting weights, it is not yet clear what n 
> should be, and this issue awaits further study.  However, 
> it seems reasonable that if the model is based on an 
> assumption of simple random sampling but the sampling 
> design is less efficient, then n should be reduced to 
> reflect the efficiency of the sampling design relative to 
> simple random sampling."
I think Raftery had in mind surveys in which the assumption of
independence between observations (counts, not rows) does not hold, but
where it is still the reference from which the sample deviates (lower
"efficiency"). In this case, the number of cells/rows by no means a good
measure of the number of observations either -- but as I said BIC is
usually considered as not defined in this case.

Thanks for sharing your remarks.