[R-sig-Geo] Problem with categorical variable coefficients and se in glm

Mon Nov 5 02:41:54 CET 2012

Will, if your category UM is all zeros, there can be no variance to it, so I would say that you cannot make any statistical inference about it. Presumably there is no option of collecting more data? Certainly changing one of the data points from zero to one gives a cleaner-looking estimate of standard error, but it's an incorrect one. 

I would consider leaving the UM category out of the formal data analysis. You can still report & discuss the findings regarding the UM category, but by keeping the made-up variance out of the statistical analysis you can avoid giving that impression that the statistical results are contaminated in some way. 

--Seth

-----Original Message-----
From: r-sig-geo-bounces at r-project.org [mailto:r-sig-geo-bounces at r-project.org] On Behalf Of WillM
Sent: Saturday, November 03, 2012 7:16 PM
To: r-sig-geo at r-project.org
Subject: [R-sig-Geo] Problem with categorical variable coefficients and se in glm

Hi all

I'm hoping that this is something that people deal with regularly and you
can help me out quickly even though it is a bit more of a stats question
than R.

I have a dataset where data$Resp is a count response variable (lots of 0s)
so I used a negative binomial glm with a categorical response variable. The
categories are the types of vegetation that I stratified my sampling by - so
they are not an arbitrary post hoc decision.

The UM category only has 0's for a response and produces a large coefficient
and large standard error (see output below). So I added a small number (1)
to one row of the UM category to explore what was happening and get a better
result. With a continuous response variable you can add a very small number
(say 0.001) so that it is still representative of 0, but with this count
data, 1 is the minimum.

I get a better estimate, but is there some better way of dealing with this
type of situation? I could possibly combine UM and UB categories, but I did
want to keep them separate.

Thanks alot :)

>data

   Resp Cat
1     0   D
2     0   D
3     0   D
4     0   D
5     3   D
6     0   D
7     0   D
8     0   D
9    11   F
10   11   F
11    3   F
12   14   F
13   19   F
14   41   F
15   12   S
16   55   S
17    3   S
18    0   S
19    0   S
20   30   F
21    4   F
22   10   F
23   99  DS
24    3  DS
25    1  DS
26    7  DS
27    4  DS
28    0  DS
29    2  DS
30    1  DS
31    0  UB
32    0  UB
33    0  UB
34    0  UB
35    1  UB
36    0  UM
37    0  UM
38    0  UM
39    0  UM
40    0  UM

> mod.nb <- glm.nb(data$Resp ~ data$Cat, data=data) 

> summary(mod.nb)

Call:
glm.nb(formula = data$Resp ~ data$Cat, data = data, init.theta =
0.5087557508, 
    link = log)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.85799  -0.75714  -0.58082  -0.00009   1.95946  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -0.9808     0.7609  -1.289 0.197409    
data$CatF      3.7464     0.8969   4.177 2.95e-05 ***
data$CatS      3.6199     0.9932   3.645 0.000268 ***
data$CatDS     3.6636     0.9128   4.013 5.99e-05 ***
data$CatUB    -0.6286     1.4043  -0.448 0.654427    
data$CatUM   -18.3218  4215.7113  -0.004 0.996532    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Negative Binomial(0.5088) family taken to be 1)

    Null deviance: 77.163  on 39  degrees of freedom
Residual deviance: 33.700  on 34  degrees of freedom
AIC: 191.12

Number of Fisher Scoring iterations: 1

              Theta:  0.509 
          Std. Err.:  0.152 

 2 x log-likelihood:  -177.120 

> data1<-data
> data1[40,1]<- 1 #add a small value to one of the UM categories
> mod.nb1 <- glm.nb(data1$Resp ~ data1$Cat, data=data1)     #run model again

> summary(mod.nb1)

Call:
glm.nb(formula = data1$Resp ~ data1$Cat, data = data1, init.theta =
0.515774723, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8671  -0.7593  -0.5814  -0.2098   1.9726  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9808     0.7587  -1.293 0.196112    
data1$CatF    3.7464     0.8934   4.194 2.75e-05 ***
data1$CatS    3.6199     0.9888   3.661 0.000251 ***
data1$CatDS   3.6636     0.9092   4.030 5.59e-05 ***
data1$CatUB  -0.6286     1.4012  -0.449 0.653712    
data1$CatUM  -0.6286     1.4012  -0.449 0.653712    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for Negative Binomial(0.5158) family taken to be 1)

    Null deviance: 76.133  on 39  degrees of freedom
Residual deviance: 36.328  on 34  degrees of freedom
AIC: 196.69

Number of Fisher Scoring iterations: 1

              Theta:  0.516 
          Std. Err.:  0.152 

 2 x log-likelihood:  -182.686 

Thanks! 

--
View this message in context: http://r-sig-geo.2731867.n2.nabble.com/Problem-with-categorical-variable-coefficients-and-se-in-glm-tp7581579.html
Sent from the R-sig-geo mailing list archive at Nabble.com.

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo