# [R] Systematically biased count data regression model

Gabor Grothendieck ggrothendieck at gmail.com
Fri Aug 10 04:10:08 CEST 2007

```I guess I should not have been so quick to make that conclusion since
it seems that 74% of the values in the holdout set are FALSE so simply
guessing FALSE for each one would give us 74% accuracy:

> table(DD[201:254])

FALSE  TRUE
40    14

> 40/54
[1] 0.7407407

On 8/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Perhaps you don't really need to predict the precise count.
> Maybe its good enough to predict whether the count is above
> or below average.  In that case the model is 74% correct on a
> holdout sample of the last 54 points based on a model of the
> first 200 points.
>
> > # create model on first 200 and predict on rest
> > DD <- data\$D > mean(data\$D)
> > mod <- glm(DD ~., data[-1], family = binomial, subset = 1:200)
> > tab <- table(predict(mod, data[201:254,-1], type = "resp") > .5, DD[201:254])
> > sum(tab * diag(2)) / sum(tab)
> [1] 0.7407407
>
>
> On 8/9/07, Matthew and Kim Bowser <matthewnkim at gmail.com> wrote:
> > Dear all,
> >
> > I am attempting to explain patterns of arthropod family richness
> > (count data) using a regression model.  It seems to be able to do a
> > pretty good job as an explanatory model (i.e. demonstrating
> > relationships between dependent and independent variables), but it has
> > systematic problems as a predictive model:  It is biased high at low
> > observed values of family richness and biased low at high observed
> > values of family richness (see attached pdf).  I have tried diverse
> > kinds of reasonable regression models mostly as in Zeileis, et al.
> > (2007), as well as transforming my variables, both with only small
> > improvements.
> >
> > Do you have suggestions for making a model that would perform better
> > as a predictive model?
> >
> > Thank you for your time.
> >
> > Sincerely,
> >
> > Matthew Bowser
> >
> > STEP student
> > USFWS Kenai National Wildlife Refuge
> >
> > M.Sc. student
> > University of Alaska Fairbanks
> >
> > Reference
> >
> > Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for
> > count data in R. Technical Report 53, Department of Statistics and
> > Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL
> > http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf.
> >
> > Code
> >
> > `data` <-
> > structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4,
> > 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15,
> > 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13,
> > 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12,
> > 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8,
> > 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8,
> > 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8,
> > 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16,
> > 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12,
> > 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6,
> > 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3,
> > 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3,
> > 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15,
> > 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159,
> > 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161,
> > 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161,
> > 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176,
> > 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165,
> > 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167,
> > 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164,
> > 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173,
> > 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162,
> > 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166,
> > 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164,
> > 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172,
> > 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174,
> > 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174,
> > 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170,
> > 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172,
> > 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160,
> > 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176,
> > 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166,
> > 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166,
> > 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253,
> > 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26,
> > 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194,
> > 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354,
> > 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312,
> > 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122,
> > 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28,
> > 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34,
> > 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221,
> > 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232,
> > 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318,
> > 0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2,
> > 0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38,
> > 0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288,
> > 0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162,
> > 0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306,
> > 0.28, 0.279, 0.208, 0.377, 0.413, 0.489, 0.417, 0.333, 0.208,
> > 0.232, 0.431, 0.283, 0.241, 0.105, 0.18, -0.172, -0.374, 0.25,
> > 0.043, 0.215, 0.204, 0.19, 0.177, -0.106, -0.143, 0.062, 0.462,
> > 0.256, 0.229, 0.314, 0.415, 0.307, 0.238, -0.35, 0.34, 0.275,
> > 0.097, 0.353, 0.214, 0.435, 0.055, -0.289, 0.239, 0.186, 0.135,
> > 0.259, 0.268, 0.258, 0.032, 0.489, 0.389, 0.298, 0.164, 0.325,
> > 0.254, -0.059, 0.524, 0.539, 0.25, 0.175, 0.326, 0.302, -0.047,
> > -0.301, -0.149, 0.358, 0.495, 0.311, 0.235, 0.558, -0.156, 0,
> > 0.146, 0.329, -0.069, -0.352, -0.356, -0.206, -0.179, 0.467,
> > -0.325, 0.39, -0.399, -0.165, 0.267, -0.334, -0.17, 0.58, 0.228,
> > 0.234, 0.351, 0.3, -0.018, 0.125, 0.176, 0.322, 0.246, 0.376,
> > -0.185, 0.342, 0.142, -0.075, 0.186, 0.333, 0.112, 0.272, 0.277,
> > 0.203, 0.37, 0.465, 0.425), VegS = c(14, 11, 18, 21, 31, 20,
> > 11, 17, 10, 15, 27, 8, 17, 13, 16, 9, 10, 16, 10, 15, 11, 9,
> > 14, 11, 11, 10, 24, 18, 12, 6, 25, 21, 25, 8, 14, 18, 11, 16,
> > 20, 16, 10, 16, 18, 14, 13, 11, 15, 23, 11, 28, 12, 17, 12, 18,
> > 10, 15, 7, 15, 9, 16, 18, 16, 20, 18, 12, 19, 16, 18, 20, 15,
> > 24, 9, 15, 9, 16, 14, 17, 14, 7, 9, 9, 12, 13, 15, 14, 11, 17,
> > 8, 14, 15, 12, 8, 10, 12, 8, 16, 15, 22, 16, 21, 10, 15, 20,
> > 14, 27, 21, 19, 22, 21, 11, 13, 10, 13, 14, 9, 22, 22, 20, 12,
> > 16, 20, 19, 26, 14, 13, 23, 14, 22, 19, 15, 28, 16, 20, 25, 10,
> > 19, 0, 10, 8, 11, 17, 13, 17, 23, 37, 32, 19, 26, 12, 11, 24,
> > 11, 21, 25, 8, 15, 21, 31, 17, 0, 12, 6, 23, 19, 29, 14, 9, 0,
> > 18, 23, 20, 15, 15, 17, 27, 17, 2, 24, 17, 16, 26, 11, 23, 24,
> > 10, 26, 21, 12, 20, 12, 29, 22, 20, 16, 41, 19, 27, 28, 10, 35,
> > 28, 23, 14, 5, 23, 15, 17, 12, 11, 24, 11, 14, 7, 7, 27, 17,
> > 15, 10, 16, 2, 11, 21, 18, 15, 8, 17, 10, 18, 15, 18, 31, 9,
> > 14, 11, 19, 22, 8, 19, 17, 18, 25, 11, 17, 32, 25, 18, 21, 19,
> > 35, 14, 29, 9, 28, 14), T = c(13, 15.4, 15.6, 12.3, 12.7, 13.3,
> > 6, 13, 9.2, 17.8, 9.4, 14.2, 8.7, 14, 8.6, 13.9, 9.2, 15.1, 9.4,
> > 16.5, 14, 11.5, 15.5, 13.3, 12.7, 14, 10.5, 14, 10.1, 16.7, 15.2,
> > 11.2, 11.7, 17.9, 13.3, 13.9, 8.7, 16.7, 7.8, 7.9, 10.9, 15.5,
> > 14, 14.1, 14.3, 13.3, 11.6, 16.5, 12.7, 12.6, 8.3, 9, 12.4, 15,
> > 11.8, 14.1, 10.5, 12.4, 10.5, 17.5, 9.2, 16.3, 5.3, 5.9, 11.9,
> > 8.9, 7.7, 15.2, 8.6, 13.2, 9.5, 15.8, 12.5, 13.8, 10.7, 10.5,
> > 7.7, 11, 9.3, 14.6, 12, 15.4, 12.3, 14, 8.3, 19.8, 15.5, 14.3,
> > 9, 7.6, 15.3, 12.8, 14.4, 14, 10.5, 8.9, 13.4, 12.8, 12.9, 11.2,
> > 13.1, 10, 12.4, 15.4, 7.6, 14.9, 13.1, 11.1, 8.6, 13.6, 8.4,
> > 11.5, 12.5, 15.6, 8.3, 9.6, 8.7, 9.7, 10.5, 12.8, 8.6, 12.7,
> > 6.7, 8.5, 9.9, 8.3, 12.8, 9.8, 10.5, 10.7, 9.6, 11.1, 13.7, 9.5,
> > 8.8, 3.4, 10.2, 3.5, 8.4, 11.9, 12.3, 10.2, 13.1, 8.4, 3.1, 7.2,
> > 13.2, 7.6, 11.3, 13.5, 7.8, 5.5, 10.7, 4.8, 7.9, 13.5, 13.5,
> > 2.1, 7.2, 9.7, 12, 9.2, 13.2, 12, 17.1, 7.9, 12.7, 11.8, 16,
> > 6.4, 12.9, 8.1, 12.6, 10.2, 13.3, 8.5, 9.7, 9.9, 18.2, 11, 8.4,
> > 1.5, 7.3, 10.6, 13.6, 8.4, 7.2, 13, 15.5, 9.7, 13.2, 5.9, 9.5,
> > 10.4, 12.9, 2.6, 17.2, 15.4, 10.5, 6.7, 6.6, 7.6, 10.5, 15.6,
> > 10.4, 5.1, 11, 9.7, 4.2, 3.6, 8.5, 11.5, 8.4, 6.9, 11, 10.4,
> > 3.4, 3.2, 5.5, 2.4, 11.2, 2.6, 15.1, 16, 13.7, 10.5, 3.5, 13.4,
> > 11.5, 12.3, 13.9, 14.5, 12.8, 16.8, 16.9, 13.5, 17.2, 12.5, 12.4,
> > 11.8, 12, 10.9, 6.7, 10.9, 2.3, 5.2, 13.1, 12.1, 13.9, 12.9,
> > 7.2, 12.5, 16, 11.7), Hemlock = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
> > 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0), Alpine = c(0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
> > 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
> > 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
> > 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
> > 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
> > 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
> > 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0),
> >   Snow = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
> >   1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >   0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("D", "Day", "NDVI",
> > "VegS", "T", "Hemlock", "Alpine", "Snow"), row.names = as.integer(c(NA,
> > 254)), class = "data.frame")
> >
> >
> > #Running a regression.
> > library(MASS)
> > fit <- glm.nb(D ~ Day + NDVI + VegS + T + Hemlock + Alpine + Snow, data = data)
> > summary(fit, correlation = FALSE)
> >
> > Call:
> > glm.nb(formula = D ~ Day + NDVI + VegS + T + Hemlock + Alpine +
> >   Snow, data = data, init.theta = 11.3494468596771, link = log)
> >
> > Deviance Residuals:
> >   Min       1Q   Median       3Q      Max
> > -3.7451  -0.7196  -0.1958   0.5389   2.7096
> >
> > Coefficients:
> >            Estimate Std. Error z value Pr(>|z|)
> > (Intercept) -2.882684   0.929598  -3.101 0.001929 **
> > Day          0.020325   0.005540   3.669 0.000244 ***
> > NDVI         1.353361   0.221471   6.111 9.91e-10 ***
> > VegS         0.016731   0.004931   3.393 0.000691 ***
> > T            0.074189   0.009491   7.817 5.42e-15 ***
> > Hemlock     -0.588858   0.174980  -3.365 0.000765 ***
> > Alpine      -0.452199   0.099296  -4.554 5.26e-06 ***
> > Snow        -1.902610   0.735708  -2.586 0.009707 **
> > ---
> > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> >
> > (Dispersion parameter for Negative Binomial(11.3494) family taken to be 1)
> >
> >   Null deviance: 515.16  on 253  degrees of freedom
> > Residual deviance: 278.89  on 246  degrees of freedom
> > AIC: 1300.1
> >
> > Number of Fisher Scoring iterations: 1
> >
> >
> >             Theta:  11.35
> >         Std. Err.:  2.71
> >
> >  2 x log-likelihood:  -1282.075
> >
> >
> > #Plotting observed versus predicted values.
> > pdf(file="ObsVPred.pdf", width=4, height=4, family="Times", pointsize=11)
> > par(mar = c(5,5,1,1), pch=1)
> > plot(data\$D, fit\$fitted.values, main="",
> > ylab=expression(italic(D)[predicted]),
> > xlab=expression(italic(D)[observed]))
> > abline(a=0,b=1, lty=2)
> > lines(lowess(data\$D, fit\$fitted.values))
> > dev.off()
> >
> >
> > #This appears to be a decent explanatory model, but as a predictive
> > model it is systematically biased.  It is biased high at low observed
> > values of D and biased low at high values observed values of D.
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help