[R] Unexpected behavior with weights in binomial glm()
Ben Bolker
bbolker at gmail.com
Mon Oct 1 00:15:08 CEST 2012
Bert Gunter <gunter.berton <at> gene.com> writes:
>
> I haven't followed this thread closely, but if perfect separation in a
> binomial glm is the problem, google it. e.g.
>
> http://www.ats.ucla.edu/stat/mult_pkg/faq/general
> /complete_separation_logit_models.htm
>
> This presumably explains your concerns about coefficient agreement.
>
Agreed. The rest of my answer is below.
Josh Browning <rockclimber112358 <at> gmail.com> writes:
> Yes, I agree that the results are "very similar" but I don't
> understand why they are not exactly equal given that the data sets are
> identical.
>
> And yes, this 1% numerical difference is hugely important to me. I
> have another data set (much larger than this toy example) that works
> on the aggregated data (returning a coefficient of about 1) but
> returns the warning about perfect separation on the non-aggregated
> data (and a coefficient of about 1e15). So, I'd at least like to be
> able to understand where this numerical difference is coming from and,
> preferably, a way to tweak my glm() runs (possibly adjusting the
> numerical precision somehow???) so that this doesn't happen.
>
> Josh
I played around with this a bit, and I think the problem is so
numerically unstable that you really can't just tweak the settings on
glm() to make it work. (When a problem is numerically unstable,
nearly trivial differences like the order of operations or even the
compiler used can make big differences in the results.)
There's a very nice blog post about the numerics of GLM here:
http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/
One of the conclusions is
And most practitioners are unfamiliar with this situation
[numerical instability of GLMs in some cases] because:
* They rightly do not concern themselves with the implementation
details, as these are best left to the software implementors.
* They are very likely to encounter issues arise from separation,
which will mask other issues.
You appear to have a (near- or complete-) separation problem.
I would strongly recommend
the logistf package (when I tried it, I got near-identical results
from the aggregated and disaggregated data).
I would also argue that if a 1% difference in the estimate of a
parameter whose confidence interval is essentially undefined (try
MASS:::confint() on your results) is concerning you, then you have
some bigger problems to wrestle with ...
good luck
Ben Bolker
More information about the R-help
mailing list