[R] Systematically biased count data regression model
Matthew and Kim Bowser
matthewnkim at gmail.com
Thu Aug 9 17:43:14 CEST 2007
Dear all,
I am attempting to explain patterns of arthropod family richness
(count data) using a regression model. It seems to be able to do a
pretty good job as an explanatory model (i.e. demonstrating
relationships between dependent and independent variables), but it has
systematic problems as a predictive model: It is biased high at low
observed values of family richness and biased low at high observed
values of family richness (see attached pdf). I have tried diverse
kinds of reasonable regression models mostly as in Zeileis, et al.
(2007), as well as transforming my variables, both with only small
improvements.
Do you have suggestions for making a model that would perform better
as a predictive model?
Thank you for your time.
Sincerely,
Matthew Bowser
STEP student
USFWS Kenai National Wildlife Refuge
Soldotna, Alaska, USA
M.Sc. student
University of Alaska Fairbanks
Fairbankse, Alaska, USA
Reference
Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for
count data in R. Technical Report 53, Department of Statistics and
Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL
http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf.
Code
`data` <-
structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4,
9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15,
12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13,
1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12,
5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8,
5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8,
7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8,
10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16,
3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12,
16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6,
4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3,
6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3,
0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15,
2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159,
159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161,
175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161,
161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176,
165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165,
165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167,
175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164,
167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173,
178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162,
173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166,
170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164,
170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172,
162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174,
166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174,
172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170,
171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172,
171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160,
160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176,
176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166,
168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166,
166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253,
0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26,
0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194,
0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354,
0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312,
0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122,
0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28,
0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34,
0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221,
0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232,
0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318,
0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2,
0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38,
0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288,
0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162,
0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306,
0.28, 0.279, 0.208, 0.377, 0.413, 0.489, 0.417, 0.333, 0.208,
0.232, 0.431, 0.283, 0.241, 0.105, 0.18, -0.172, -0.374, 0.25,
0.043, 0.215, 0.204, 0.19, 0.177, -0.106, -0.143, 0.062, 0.462,
0.256, 0.229, 0.314, 0.415, 0.307, 0.238, -0.35, 0.34, 0.275,
0.097, 0.353, 0.214, 0.435, 0.055, -0.289, 0.239, 0.186, 0.135,
0.259, 0.268, 0.258, 0.032, 0.489, 0.389, 0.298, 0.164, 0.325,
0.254, -0.059, 0.524, 0.539, 0.25, 0.175, 0.326, 0.302, -0.047,
-0.301, -0.149, 0.358, 0.495, 0.311, 0.235, 0.558, -0.156, 0,
0.146, 0.329, -0.069, -0.352, -0.356, -0.206, -0.179, 0.467,
-0.325, 0.39, -0.399, -0.165, 0.267, -0.334, -0.17, 0.58, 0.228,
0.234, 0.351, 0.3, -0.018, 0.125, 0.176, 0.322, 0.246, 0.376,
-0.185, 0.342, 0.142, -0.075, 0.186, 0.333, 0.112, 0.272, 0.277,
0.203, 0.37, 0.465, 0.425), VegS = c(14, 11, 18, 21, 31, 20,
11, 17, 10, 15, 27, 8, 17, 13, 16, 9, 10, 16, 10, 15, 11, 9,
14, 11, 11, 10, 24, 18, 12, 6, 25, 21, 25, 8, 14, 18, 11, 16,
20, 16, 10, 16, 18, 14, 13, 11, 15, 23, 11, 28, 12, 17, 12, 18,
10, 15, 7, 15, 9, 16, 18, 16, 20, 18, 12, 19, 16, 18, 20, 15,
24, 9, 15, 9, 16, 14, 17, 14, 7, 9, 9, 12, 13, 15, 14, 11, 17,
8, 14, 15, 12, 8, 10, 12, 8, 16, 15, 22, 16, 21, 10, 15, 20,
14, 27, 21, 19, 22, 21, 11, 13, 10, 13, 14, 9, 22, 22, 20, 12,
16, 20, 19, 26, 14, 13, 23, 14, 22, 19, 15, 28, 16, 20, 25, 10,
19, 0, 10, 8, 11, 17, 13, 17, 23, 37, 32, 19, 26, 12, 11, 24,
11, 21, 25, 8, 15, 21, 31, 17, 0, 12, 6, 23, 19, 29, 14, 9, 0,
18, 23, 20, 15, 15, 17, 27, 17, 2, 24, 17, 16, 26, 11, 23, 24,
10, 26, 21, 12, 20, 12, 29, 22, 20, 16, 41, 19, 27, 28, 10, 35,
28, 23, 14, 5, 23, 15, 17, 12, 11, 24, 11, 14, 7, 7, 27, 17,
15, 10, 16, 2, 11, 21, 18, 15, 8, 17, 10, 18, 15, 18, 31, 9,
14, 11, 19, 22, 8, 19, 17, 18, 25, 11, 17, 32, 25, 18, 21, 19,
35, 14, 29, 9, 28, 14), T = c(13, 15.4, 15.6, 12.3, 12.7, 13.3,
6, 13, 9.2, 17.8, 9.4, 14.2, 8.7, 14, 8.6, 13.9, 9.2, 15.1, 9.4,
16.5, 14, 11.5, 15.5, 13.3, 12.7, 14, 10.5, 14, 10.1, 16.7, 15.2,
11.2, 11.7, 17.9, 13.3, 13.9, 8.7, 16.7, 7.8, 7.9, 10.9, 15.5,
14, 14.1, 14.3, 13.3, 11.6, 16.5, 12.7, 12.6, 8.3, 9, 12.4, 15,
11.8, 14.1, 10.5, 12.4, 10.5, 17.5, 9.2, 16.3, 5.3, 5.9, 11.9,
8.9, 7.7, 15.2, 8.6, 13.2, 9.5, 15.8, 12.5, 13.8, 10.7, 10.5,
7.7, 11, 9.3, 14.6, 12, 15.4, 12.3, 14, 8.3, 19.8, 15.5, 14.3,
9, 7.6, 15.3, 12.8, 14.4, 14, 10.5, 8.9, 13.4, 12.8, 12.9, 11.2,
13.1, 10, 12.4, 15.4, 7.6, 14.9, 13.1, 11.1, 8.6, 13.6, 8.4,
11.5, 12.5, 15.6, 8.3, 9.6, 8.7, 9.7, 10.5, 12.8, 8.6, 12.7,
6.7, 8.5, 9.9, 8.3, 12.8, 9.8, 10.5, 10.7, 9.6, 11.1, 13.7, 9.5,
8.8, 3.4, 10.2, 3.5, 8.4, 11.9, 12.3, 10.2, 13.1, 8.4, 3.1, 7.2,
13.2, 7.6, 11.3, 13.5, 7.8, 5.5, 10.7, 4.8, 7.9, 13.5, 13.5,
2.1, 7.2, 9.7, 12, 9.2, 13.2, 12, 17.1, 7.9, 12.7, 11.8, 16,
6.4, 12.9, 8.1, 12.6, 10.2, 13.3, 8.5, 9.7, 9.9, 18.2, 11, 8.4,
1.5, 7.3, 10.6, 13.6, 8.4, 7.2, 13, 15.5, 9.7, 13.2, 5.9, 9.5,
10.4, 12.9, 2.6, 17.2, 15.4, 10.5, 6.7, 6.6, 7.6, 10.5, 15.6,
10.4, 5.1, 11, 9.7, 4.2, 3.6, 8.5, 11.5, 8.4, 6.9, 11, 10.4,
3.4, 3.2, 5.5, 2.4, 11.2, 2.6, 15.1, 16, 13.7, 10.5, 3.5, 13.4,
11.5, 12.3, 13.9, 14.5, 12.8, 16.8, 16.9, 13.5, 17.2, 12.5, 12.4,
11.8, 12, 10.9, 6.7, 10.9, 2.3, 5.2, 13.1, 12.1, 13.9, 12.9,
7.2, 12.5, 16, 11.7), Hemlock = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0), Alpine = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0),
Snow = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("D", "Day", "NDVI",
"VegS", "T", "Hemlock", "Alpine", "Snow"), row.names = as.integer(c(NA,
254)), class = "data.frame")
#Running a regression.
library(MASS)
fit <- glm.nb(D ~ Day + NDVI + VegS + T + Hemlock + Alpine + Snow, data = data)
summary(fit, correlation = FALSE)
Call:
glm.nb(formula = D ~ Day + NDVI + VegS + T + Hemlock + Alpine +
Snow, data = data, init.theta = 11.3494468596771, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7451 -0.7196 -0.1958 0.5389 2.7096
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.882684 0.929598 -3.101 0.001929 **
Day 0.020325 0.005540 3.669 0.000244 ***
NDVI 1.353361 0.221471 6.111 9.91e-10 ***
VegS 0.016731 0.004931 3.393 0.000691 ***
T 0.074189 0.009491 7.817 5.42e-15 ***
Hemlock -0.588858 0.174980 -3.365 0.000765 ***
Alpine -0.452199 0.099296 -4.554 5.26e-06 ***
Snow -1.902610 0.735708 -2.586 0.009707 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(11.3494) family taken to be 1)
Null deviance: 515.16 on 253 degrees of freedom
Residual deviance: 278.89 on 246 degrees of freedom
AIC: 1300.1
Number of Fisher Scoring iterations: 1
Theta: 11.35
Std. Err.: 2.71
2 x log-likelihood: -1282.075
#Plotting observed versus predicted values.
pdf(file="ObsVPred.pdf", width=4, height=4, family="Times", pointsize=11)
par(mar = c(5,5,1,1), pch=1)
plot(data$D, fit$fitted.values, main="",
ylab=expression(italic(D)[predicted]),
xlab=expression(italic(D)[observed]))
abline(a=0,b=1, lty=2)
lines(lowess(data$D, fit$fitted.values))
dev.off()
#This appears to be a decent explanatory model, but as a predictive
model it is systematically biased. It is biased high at low observed
values of D and biased low at high values observed values of D.
More information about the R-help
mailing list