# [R] Systematically biased count data regression model

Matthew and Kim Bowser matthewnkim at gmail.com
Fri Aug 10 00:43:48 CEST 2007

```Dear all,

anonymity, but to whom I am grateful.

PLEASE do not quote my name or email (I am trying to stay off spam lists)

Matthew:  I think this is just a reflection of
the fact the model does not fit perfectly.  The
example below is a simple linear regression that
is highly significant but has R-square of
16%.  This model as well is "biased high at low
observed values of y and biased low at high values observed values of y"

set.seed(1)
n <- 200
m <- data.frame(x=rnorm(n,mean=10,sd=2))
m\$y <- m\$x + rnorm(n,sd=4)    # simulate using intercept 0, slope 1
f <- lm(y ~ x,data=m)
print(summary(f))
#
# Call:
# lm(formula = y ~ x, data = m)
#
# Residuals:
#      Min       1Q   Median       3Q      Max
# -11.7310  -2.1709  -0.1009   2.6733  10.3446
#
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.6274     1.5830   0.396    0.692
# x             0.9538     0.1546   6.170 3.77e-09 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 4.052 on 198 degrees of freedom
# Multiple R-Squared: 0.1613,     Adjusted R-squared: 0.157
# F-statistic: 38.07 on 1 and 198 DF,  p-value: 3.773e-09
#
plot(m\$y,f\$fitted.values,xlab="Observed",ylab="Predicted")
lines(lowess(m\$y,f\$fitted.values),col="red",lty=2)
abline(c(0,1))
legend("topleft",lty=c(2,1),col=c("red","black"),legend=c("Loess","45-degree"))
- Show quoted text -

At 2007-08-09  08:43, Matthew and Kim Bowser wrote:
>Dear all,
>
>I am attempting to explain patterns of arthropod family richness
>(count data) using a regression model.  It seems to be able to do a
>pretty good job as an explanatory model (i.e. demonstrating
>relationships between dependent and independent variables), but it has
>systematic problems as a predictive model:  It is biased high at low
>observed values of family richness and biased low at high observed
>values of family richness (see attached pdf).  I have tried diverse
>kinds of reasonable regression models mostly as in Zeileis, et al.
>(2007), as well as transforming my variables, both with only small
>improvements.
>
>Do you have suggestions for making a model that would perform better
>as a predictive model?
>
>
>Sincerely,
>
>Matthew Bowser
>
>STEP student
>USFWS Kenai National Wildlife Refuge
>
>M.Sc. student
>
>Reference
>
>Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for
>count data in R. Technical Report 53, Department of Statistics and
>Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL
>http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf.
>
>[snip]
>
>
>#This appears to be a decent explanatory model, but as a predictive
>model it is systematically biased.  It is biased high at low observed
>values of D and biased low at high values observed values of D.
>
- Show quoted text -

On 8/9/07, Matthew and Kim Bowser <matthewnkim at gmail.com> wrote:
> Dear all,
>
> I am attempting to explain patterns of arthropod family richness
> (count data) using a regression model.  It seems to be able to do a
> pretty good job as an explanatory model (i.e. demonstrating
> relationships between dependent and independent variables), but it has
> systematic problems as a predictive model:  It is biased high at low
> observed values of family richness and biased low at high observed
> values of family richness (see attached pdf).  I have tried diverse
> kinds of reasonable regression models mostly as in Zeileis, et al.
> (2007), as well as transforming my variables, both with only small
> improvements.
>
> Do you have suggestions for making a model that would perform better
> as a predictive model?
>
> Thank you for your time.
>
> Sincerely,
>
> Matthew Bowser
>
> STEP student
> USFWS Kenai National Wildlife Refuge
>
> M.Sc. student
>
> Reference
>
> Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for
> count data in R. Technical Report 53, Department of Statistics and
> Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL
> http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf.
>
> Code
>
> `data` <-
> structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4,
> 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15,
> 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13,
> 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12,
> 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8,
> 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8,
> 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8,
> 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16,
> 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12,
> 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6,
> 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3,
> 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3,
> 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15,
> 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159,
> 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161,
> 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161,
> 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176,
> 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165,
> 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167,
> 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164,
> 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173,
> 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162,
> 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166,
> 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164,
> 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172,
> 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174,
> 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174,
> 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170,
> 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172,
> 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160,
> 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176,
> 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166,
> 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166,
> 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253,
> 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26,
> 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194,
> 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354,
> 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312,
> 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122,
> 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28,
> 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34,
> 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221,
> 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232,
> 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318,
> 0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2,
> 0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38,
> 0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288,
> 0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162,
> 0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306,
> 0.28, 0.279, 0.208, 0.377, 0.413, 0.489, 0.417, 0.333, 0.208,
> 0.232, 0.431, 0.283, 0.241, 0.105, 0.18, -0.172, -0.374, 0.25,
> 0.043, 0.215, 0.204, 0.19, 0.177, -0.106, -0.143, 0.062, 0.462,
> 0.256, 0.229, 0.314, 0.415, 0.307, 0.238, -0.35, 0.34, 0.275,
> 0.097, 0.353, 0.214, 0.435, 0.055, -0.289, 0.239, 0.186, 0.135,
> 0.259, 0.268, 0.258, 0.032, 0.489, 0.389, 0.298, 0.164, 0.325,
> 0.254, -0.059, 0.524, 0.539, 0.25, 0.175, 0.326, 0.302, -0.047,
> -0.301, -0.149, 0.358, 0.495, 0.311, 0.235, 0.558, -0.156, 0,
> 0.146, 0.329, -0.069, -0.352, -0.356, -0.206, -0.179, 0.467,
> -0.325, 0.39, -0.399, -0.165, 0.267, -0.334, -0.17, 0.58, 0.228,
> 0.234, 0.351, 0.3, -0.018, 0.125, 0.176, 0.322, 0.246, 0.376,
> -0.185, 0.342, 0.142, -0.075, 0.186, 0.333, 0.112, 0.272, 0.277,
> 0.203, 0.37, 0.465, 0.425), VegS = c(14, 11, 18, 21, 31, 20,
> 11, 17, 10, 15, 27, 8, 17, 13, 16, 9, 10, 16, 10, 15, 11, 9,
> 14, 11, 11, 10, 24, 18, 12, 6, 25, 21, 25, 8, 14, 18, 11, 16,
> 20, 16, 10, 16, 18, 14, 13, 11, 15, 23, 11, 28, 12, 17, 12, 18,
> 10, 15, 7, 15, 9, 16, 18, 16, 20, 18, 12, 19, 16, 18, 20, 15,
> 24, 9, 15, 9, 16, 14, 17, 14, 7, 9, 9, 12, 13, 15, 14, 11, 17,
> 8, 14, 15, 12, 8, 10, 12, 8, 16, 15, 22, 16, 21, 10, 15, 20,
> 14, 27, 21, 19, 22, 21, 11, 13, 10, 13, 14, 9, 22, 22, 20, 12,
> 16, 20, 19, 26, 14, 13, 23, 14, 22, 19, 15, 28, 16, 20, 25, 10,
> 19, 0, 10, 8, 11, 17, 13, 17, 23, 37, 32, 19, 26, 12, 11, 24,
> 11, 21, 25, 8, 15, 21, 31, 17, 0, 12, 6, 23, 19, 29, 14, 9, 0,
> 18, 23, 20, 15, 15, 17, 27, 17, 2, 24, 17, 16, 26, 11, 23, 24,
> 10, 26, 21, 12, 20, 12, 29, 22, 20, 16, 41, 19, 27, 28, 10, 35,
> 28, 23, 14, 5, 23, 15, 17, 12, 11, 24, 11, 14, 7, 7, 27, 17,
> 15, 10, 16, 2, 11, 21, 18, 15, 8, 17, 10, 18, 15, 18, 31, 9,
> 14, 11, 19, 22, 8, 19, 17, 18, 25, 11, 17, 32, 25, 18, 21, 19,
> 35, 14, 29, 9, 28, 14), T = c(13, 15.4, 15.6, 12.3, 12.7, 13.3,
> 6, 13, 9.2, 17.8, 9.4, 14.2, 8.7, 14, 8.6, 13.9, 9.2, 15.1, 9.4,
> 16.5, 14, 11.5, 15.5, 13.3, 12.7, 14, 10.5, 14, 10.1, 16.7, 15.2,
> 11.2, 11.7, 17.9, 13.3, 13.9, 8.7, 16.7, 7.8, 7.9, 10.9, 15.5,
> 14, 14.1, 14.3, 13.3, 11.6, 16.5, 12.7, 12.6, 8.3, 9, 12.4, 15,
> 11.8, 14.1, 10.5, 12.4, 10.5, 17.5, 9.2, 16.3, 5.3, 5.9, 11.9,
> 8.9, 7.7, 15.2, 8.6, 13.2, 9.5, 15.8, 12.5, 13.8, 10.7, 10.5,
> 7.7, 11, 9.3, 14.6, 12, 15.4, 12.3, 14, 8.3, 19.8, 15.5, 14.3,
> 9, 7.6, 15.3, 12.8, 14.4, 14, 10.5, 8.9, 13.4, 12.8, 12.9, 11.2,
> 13.1, 10, 12.4, 15.4, 7.6, 14.9, 13.1, 11.1, 8.6, 13.6, 8.4,
> 11.5, 12.5, 15.6, 8.3, 9.6, 8.7, 9.7, 10.5, 12.8, 8.6, 12.7,
> 6.7, 8.5, 9.9, 8.3, 12.8, 9.8, 10.5, 10.7, 9.6, 11.1, 13.7, 9.5,
> 8.8, 3.4, 10.2, 3.5, 8.4, 11.9, 12.3, 10.2, 13.1, 8.4, 3.1, 7.2,
> 13.2, 7.6, 11.3, 13.5, 7.8, 5.5, 10.7, 4.8, 7.9, 13.5, 13.5,
> 2.1, 7.2, 9.7, 12, 9.2, 13.2, 12, 17.1, 7.9, 12.7, 11.8, 16,
> 6.4, 12.9, 8.1, 12.6, 10.2, 13.3, 8.5, 9.7, 9.9, 18.2, 11, 8.4,
> 1.5, 7.3, 10.6, 13.6, 8.4, 7.2, 13, 15.5, 9.7, 13.2, 5.9, 9.5,
> 10.4, 12.9, 2.6, 17.2, 15.4, 10.5, 6.7, 6.6, 7.6, 10.5, 15.6,
> 10.4, 5.1, 11, 9.7, 4.2, 3.6, 8.5, 11.5, 8.4, 6.9, 11, 10.4,
> 3.4, 3.2, 5.5, 2.4, 11.2, 2.6, 15.1, 16, 13.7, 10.5, 3.5, 13.4,
> 11.5, 12.3, 13.9, 14.5, 12.8, 16.8, 16.9, 13.5, 17.2, 12.5, 12.4,
> 11.8, 12, 10.9, 6.7, 10.9, 2.3, 5.2, 13.1, 12.1, 13.9, 12.9,
> 7.2, 12.5, 16, 11.7), Hemlock = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0), Alpine = c(0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0),
>    Snow = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
>    1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>    0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("D", "Day", "NDVI",
> "VegS", "T", "Hemlock", "Alpine", "Snow"), row.names = as.integer(c(NA,
> 254)), class = "data.frame")
>
>
> #Running a regression.
> library(MASS)
> fit <- glm.nb(D ~ Day + NDVI + VegS + T + Hemlock + Alpine + Snow, data = data)
> summary(fit, correlation = FALSE)
>
> Call:
> glm.nb(formula = D ~ Day + NDVI + VegS + T + Hemlock + Alpine +
>    Snow, data = data, init.theta = 11.3494468596771, link = log)
>
> Deviance Residuals:
>    Min       1Q   Median       3Q      Max
> -3.7451  -0.7196  -0.1958   0.5389   2.7096
>
> Coefficients:
>             Estimate Std. Error z value Pr(>|z|)
> (Intercept) -2.882684   0.929598  -3.101 0.001929 **
> Day          0.020325   0.005540   3.669 0.000244 ***
> NDVI         1.353361   0.221471   6.111 9.91e-10 ***
> VegS         0.016731   0.004931   3.393 0.000691 ***
> T            0.074189   0.009491   7.817 5.42e-15 ***
> Hemlock     -0.588858   0.174980  -3.365 0.000765 ***
> Alpine      -0.452199   0.099296  -4.554 5.26e-06 ***
> Snow        -1.902610   0.735708  -2.586 0.009707 **
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for Negative Binomial(11.3494) family taken to be 1)
>
>    Null deviance: 515.16  on 253  degrees of freedom
> Residual deviance: 278.89  on 246  degrees of freedom
> AIC: 1300.1
>
> Number of Fisher Scoring iterations: 1
>
>
>              Theta:  11.35
>          Std. Err.:  2.71
>
>  2 x log-likelihood:  -1282.075
>
>
> #Plotting observed versus predicted values.
> pdf(file="ObsVPred.pdf", width=4, height=4, family="Times", pointsize=11)
> par(mar = c(5,5,1,1), pch=1)
> plot(data\$D, fit\$fitted.values, main="",
> ylab=expression(italic(D)[predicted]),
> xlab=expression(italic(D)[observed]))
> abline(a=0,b=1, lty=2)
> lines(lowess(data\$D, fit\$fitted.values))
> dev.off()
>
>
> #This appears to be a decent explanatory model, but as a predictive
> model it is systematically biased.  It is biased high at low observed
> values of D and biased low at high values observed values of D.
>

```