[R] problem with glm(family=binomial) when some levels have only 0 proportion values

Wed Mar 2 11:01:42 CET 2011

Hello everybody

I want to compare the proportions of germinated seeds (seed batches of  
size 10) of three plant types (1,2,3) with a glm with binomial data  
(following the method in Crawley: Statistics,an introduction using R,  
p.247).
The problem seems to be that in two plant types (2,3) all plants have  
proportions = 0.
I give you my data and the model I'm running:

   success failure type
  [1,]   0   10    3
  [2,]   0   10    2
  [3,]   0   10    2
  [4,]   0   10    2
  [5,]   0   10    2
  [6,]   0   10    2
  [7,]   0   10    2
  [8,]   4    6    1
  [9,]   4    6    1
[10,]   3    7    1
[11,]   5    5    1
[12,]   7    3    1
[13,]   4    6    1
[14,]   0   10    3
[15,]   0   10    3
[16,]   0   10    3
[17,]   0   10    3
[18,]   0   10    3
[19,]   0   10    3
[20,]   0   10    2
[21,]   0   10    2
[22,]   0   10    2
[23,]   9    1    1
[24,]   6    4    1
[25,]   4    6    1
[26,]   0   10    3
[27,]   0   10    3

  y<- cbind(success, failure)

  Call:
glm(formula = y ~ type, family = binomial)

Deviance Residuals:
        Min          1Q      Median          3Q
-1.3521849  -0.0000427  -0.0000427  -0.0000427
        Max
  2.6477556

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.04445    0.21087   0.211    0.833
typeFxC      -23.16283 6696.13233  -0.003    0.997
typeFxD      -23.16283 6696.13233  -0.003    0.997

(Dispersion parameter for binomial family taken to be 1)

     Null deviance: 134.395  on 26  degrees of freedom
Residual deviance:  12.622  on 24  degrees of freedom
AIC: 42.437

Number of Fisher Scoring iterations: 20

Huge standard errors are calculated and there is no difference between  
plant type 1 and 2 or between plant type 1 and 3.
If I add 1 to all successes, so that all the 0 values disappear, the  
standard error becomes lower and I find highly significant differences  
between the plant types.

suc<- success + 1
fail<- 11 - suc
Y<- cbind(suc,fail)

Call:
glm(formula = Y ~ type, family = binomial)

Deviance Residuals:
        Min          1Q      Median          3Q
-1.279e+00  -4.712e-08  -4.712e-08   0.000e+00
        Max
  2.584e+00

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.2231     0.2023   1.103     0.27
typeFxC      -2.5257     0.4039  -6.253 4.02e-10 ***
typeFxD      -2.5257     0.4039  -6.253 4.02e-10 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for binomial family taken to be 1)

     Null deviance: 86.391  on 26  degrees of freedom
Residual deviance: 11.793  on 24  degrees of freedom
AIC: 76.77

Number of Fisher Scoring iterations: 4

So I think the 0 values of all plants of group 2 and 3 are the  
problem, do you agree?
I don't know why this is a problem, or how I can explain to a reviewer  
why a data transformation (+ 1) is necessary with such a dataset.

I would greatly appreciate any comments.
Juerg
______________________________________

Jürg Schulze
Department of Environmental Sciences
Section of Conservation Biology
University of Basel
St. Johanns-Vorstadt 10
4056 Basel, Switzerland
Tel.: ++41/61/267 08 47