[R-sig-Geo] Problem with categorical variable coefficients and se in glm
Seth W. Bigelow
seth at swbigelow.net
Mon Nov 5 02:41:54 CET 2012
Will, if your category UM is all zeros, there can be no variance to it, so I would say that you cannot make any statistical inference about it. Presumably there is no option of collecting more data? Certainly changing one of the data points from zero to one gives a cleaner-looking estimate of standard error, but it's an incorrect one.
I would consider leaving the UM category out of the formal data analysis. You can still report & discuss the findings regarding the UM category, but by keeping the made-up variance out of the statistical analysis you can avoid giving that impression that the statistical results are contaminated in some way.
--Seth
-----Original Message-----
From: r-sig-geo-bounces at r-project.org [mailto:r-sig-geo-bounces at r-project.org] On Behalf Of WillM
Sent: Saturday, November 03, 2012 7:16 PM
To: r-sig-geo at r-project.org
Subject: [R-sig-Geo] Problem with categorical variable coefficients and se in glm
Hi all
I'm hoping that this is something that people deal with regularly and you
can help me out quickly even though it is a bit more of a stats question
than R.
I have a dataset where data$Resp is a count response variable (lots of 0s)
so I used a negative binomial glm with a categorical response variable. The
categories are the types of vegetation that I stratified my sampling by - so
they are not an arbitrary post hoc decision.
The UM category only has 0's for a response and produces a large coefficient
and large standard error (see output below). So I added a small number (1)
to one row of the UM category to explore what was happening and get a better
result. With a continuous response variable you can add a very small number
(say 0.001) so that it is still representative of 0, but with this count
data, 1 is the minimum.
I get a better estimate, but is there some better way of dealing with this
type of situation? I could possibly combine UM and UB categories, but I did
want to keep them separate.
Thanks alot :)
>data
Resp Cat
1 0 D
2 0 D
3 0 D
4 0 D
5 3 D
6 0 D
7 0 D
8 0 D
9 11 F
10 11 F
11 3 F
12 14 F
13 19 F
14 41 F
15 12 S
16 55 S
17 3 S
18 0 S
19 0 S
20 30 F
21 4 F
22 10 F
23 99 DS
24 3 DS
25 1 DS
26 7 DS
27 4 DS
28 0 DS
29 2 DS
30 1 DS
31 0 UB
32 0 UB
33 0 UB
34 0 UB
35 1 UB
36 0 UM
37 0 UM
38 0 UM
39 0 UM
40 0 UM
> mod.nb <- glm.nb(data$Resp ~ data$Cat, data=data)
> summary(mod.nb)
Call:
glm.nb(formula = data$Resp ~ data$Cat, data = data, init.theta =
0.5087557508,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.85799 -0.75714 -0.58082 -0.00009 1.95946
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9808 0.7609 -1.289 0.197409
data$CatF 3.7464 0.8969 4.177 2.95e-05 ***
data$CatS 3.6199 0.9932 3.645 0.000268 ***
data$CatDS 3.6636 0.9128 4.013 5.99e-05 ***
data$CatUB -0.6286 1.4043 -0.448 0.654427
data$CatUM -18.3218 4215.7113 -0.004 0.996532
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.5088) family taken to be 1)
Null deviance: 77.163 on 39 degrees of freedom
Residual deviance: 33.700 on 34 degrees of freedom
AIC: 191.12
Number of Fisher Scoring iterations: 1
Theta: 0.509
Std. Err.: 0.152
2 x log-likelihood: -177.120
> data1<-data
> data1[40,1]<- 1 #add a small value to one of the UM categories
> mod.nb1 <- glm.nb(data1$Resp ~ data1$Cat, data=data1) #run model again
> summary(mod.nb1)
Call:
glm.nb(formula = data1$Resp ~ data1$Cat, data = data1, init.theta =
0.515774723,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8671 -0.7593 -0.5814 -0.2098 1.9726
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9808 0.7587 -1.293 0.196112
data1$CatF 3.7464 0.8934 4.194 2.75e-05 ***
data1$CatS 3.6199 0.9888 3.661 0.000251 ***
data1$CatDS 3.6636 0.9092 4.030 5.59e-05 ***
data1$CatUB -0.6286 1.4012 -0.449 0.653712
data1$CatUM -0.6286 1.4012 -0.449 0.653712
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(0.5158) family taken to be 1)
Null deviance: 76.133 on 39 degrees of freedom
Residual deviance: 36.328 on 34 degrees of freedom
AIC: 196.69
Number of Fisher Scoring iterations: 1
Theta: 0.516
Std. Err.: 0.152
2 x log-likelihood: -182.686
Thanks!
--
View this message in context: http://r-sig-geo.2731867.n2.nabble.com/Problem-with-categorical-variable-coefficients-and-se-in-glm-tp7581579.html
Sent from the R-sig-geo mailing list archive at Nabble.com.
_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
More information about the R-sig-Geo
mailing list