[R] problem with glm(family=binomial) when some levels have only 0 proportion values

Robert A LaBudde ral at lcfltd.com
Wed Mar 2 16:08:23 CET 2011


The algorithm is not converging. Your iterations are at the maximum.

It won't do any good to add a fractional number 
to all data, as the result will depend on the 
number added (try 1.0, 0.5 and 0.1 to see this).

The root problem is that your data are 
degenerate. Firstly, your types '2' and '3' are 
indistinguishable in your data. Secondly, 
consider the case without 'type'. If you have all 
zero data for 10 trials, you cannot discriminate 
among mu = 0, 0.00001, 0.0001, 0.001 or 0.01. 
This leads to numerical instability. Thirdly, the 
variance estimate in the IRLS will start at 0.0, which gives a singularity.

Fundamentally, the algorithm is failing because 
you are at the boundary of possibilities  for a 
parameter, so special techniques are needed to do 
maximum likelihood estimation.

The simple solution is to deal with the data for 
your types separately. Another is to do more 
batches for '2' and '3' to get an observed failure.



At 05:01 AM 3/2/2011, Jürg Schulze wrote:
>Hello everybody
>
>I want to compare the proportions of germinated seeds (seed batches of
>size 10) of three plant types (1,2,3) with a glm with binomial data
>(following the method in Crawley: Statistics,an introduction using R,
>p.247).
>The problem seems to be that in two plant types (2,3) all plants have
>proportions = 0.
>I give you my data and the model I'm running:
>
>   success failure type
>  [1,]   0   10    3
>  [2,]   0   10    2
>  [3,]   0   10    2
>  [4,]   0   10    2
>  [5,]   0   10    2
>  [6,]   0   10    2
>  [7,]   0   10    2
>  [8,]   4    6    1
>  [9,]   4    6    1
>[10,]   3    7    1
>[11,]   5    5    1
>[12,]   7    3    1
>[13,]   4    6    1
>[14,]   0   10    3
>[15,]   0   10    3
>[16,]   0   10    3
>[17,]   0   10    3
>[18,]   0   10    3
>[19,]   0   10    3
>[20,]   0   10    2
>[21,]   0   10    2
>[22,]   0   10    2
>[23,]   9    1    1
>[24,]   6    4    1
>[25,]   4    6    1
>[26,]   0   10    3
>[27,]   0   10    3
>
>  y<- cbind(success, failure)
>
>  Call:
>glm(formula = y ~ type, family = binomial)
>
>Deviance Residuals:
>        Min          1Q      Median          3Q
>-1.3521849  -0.0000427  -0.0000427  -0.0000427
>        Max
>  2.6477556
>
>Coefficients:
>               Estimate Std. Error z value Pr(>|z|)
>(Intercept)    0.04445    0.21087   0.211    0.833
>typeFxC      -23.16283 6696.13233  -0.003    0.997
>typeFxD      -23.16283 6696.13233  -0.003    0.997
>
>(Dispersion parameter for binomial family taken to be 1)
>
>     Null deviance: 134.395  on 26  degrees of freedom
>Residual deviance:  12.622  on 24  degrees of freedom
>AIC: 42.437
>
>Number of Fisher Scoring iterations: 20
>
>
>Huge standard errors are calculated and there is no difference between
>plant type 1 and 2 or between plant type 1 and 3.
>If I add 1 to all successes, so that all the 0 values disappear, the
>standard error becomes lower and I find highly significant differences
>between the plant types.
>
>suc<- success + 1
>fail<- 11 - suc
>Y<- cbind(suc,fail)
>
>Call:
>glm(formula = Y ~ type, family = binomial)
>
>Deviance Residuals:
>        Min          1Q      Median          3Q
>-1.279e+00  -4.712e-08  -4.712e-08   0.000e+00
>        Max
>  2.584e+00
>
>Coefficients:
>             Estimate Std. Error z value Pr(>|z|)
>(Intercept)   0.2231     0.2023   1.103     0.27
>typeFxC      -2.5257     0.4039  -6.253 4.02e-10 ***
>typeFxD      -2.5257     0.4039  -6.253 4.02e-10 ***
>---
>Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>
>(Dispersion parameter for binomial family taken to be 1)
>
>     Null deviance: 86.391  on 26  degrees of freedom
>Residual deviance: 11.793  on 24  degrees of freedom
>AIC: 76.77
>
>Number of Fisher Scoring iterations: 4
>
>
>So I think the 0 values of all plants of group 2 and 3 are the
>problem, do you agree?
>I don't know why this is a problem, or how I can explain to a reviewer
>why a data transformation (+ 1) is necessary with such a dataset.
>
>I would greatly appreciate any comments.
>Juerg
>______________________________________
>
>Jürg Schulze
>Department of Environmental Sciences
>Section of Conservation Biology
>University of Basel
>St. Johanns-Vorstadt 10
>4056 Basel, Switzerland
>Tel.: ++41/61/267 08 47
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
================================================================



More information about the R-help mailing list