[R] Systematically biased count data regression model
paulandpen
paulandpen at optusnet.com.au
Thu Aug 9 18:12:00 CEST 2007
Matthew
it is possible that your results are suffering from heterogeneity, it may be
that your model performs well at the aggregate level and this would explain
good aggregate fit levels and decent predictive performance etc,
you could perhaps look at a 'latent' approach to modelling your data, in
other words, see if there is something unique in the cases/data/observations
in the lower and upper levels of the model (where prediction is poor) and
whether it is justified that you model these count areas as spearate and
unique from the generic aggregate level model (in other words there is
something unobserved/unmeasurted or latent etc in your popn of observations
that could causing some observations to behave uniquely overall
hth
thanks Paul
----- Original Message -----
From: "Matthew and Kim Bowser" <matthewnkim at gmail.com>
To: <r-help at stat.math.ethz.ch>
Sent: Friday, August 10, 2007 1:43 AM
Subject: [R] Systematically biased count data regression model
> Dear all,
>
> I am attempting to explain patterns of arthropod family richness
> (count data) using a regression model. It seems to be able to do a
> pretty good job as an explanatory model (i.e. demonstrating
> relationships between dependent and independent variables), but it has
> systematic problems as a predictive model: It is biased high at low
> observed values of family richness and biased low at high observed
> values of family richness (see attached pdf). I have tried diverse
> kinds of reasonable regression models mostly as in Zeileis, et al.
> (2007), as well as transforming my variables, both with only small
> improvements.
>
> Do you have suggestions for making a model that would perform better
> as a predictive model?
>
> Thank you for your time.
>
> Sincerely,
>
> Matthew Bowser
>
> STEP student
> USFWS Kenai National Wildlife Refuge
> Soldotna, Alaska, USA
>
> M.Sc. student
> University of Alaska Fairbanks
> Fairbankse, Alaska, USA
>
> Reference
>
> Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for
> count data in R. Technical Report 53, Department of Statistics and
> Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL
> http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf.
>
> Code
>
> `data` <-
> structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4,
> 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15,
> 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13,
> 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12,
> 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8,
> 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8,
> 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8,
> 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16,
> 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12,
> 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6,
> 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3,
> 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3,
> 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15,
> 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159,
> 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161,
> 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161,
> 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176,
> 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165,
> 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167,
> 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164,
> 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173,
> 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162,
> 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166,
> 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164,
> 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172,
> 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174,
> 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174,
> 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170,
> 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172,
> 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160,
> 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176,
> 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166,
> 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166,
> 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253,
> 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26,
> 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194,
> 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354,
> 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312,
> 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122,
> 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28,
> 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34,
> 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221,
> 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232,
> 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318,
> 0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2,
> 0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38,
> 0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288,
> 0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162,
> 0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306,
> 0.28, 0.279, 0.208, 0.377, 0.413, 0.489, 0.417, 0.333, 0.208,
> 0.232, 0.431, 0.283, 0.241, 0.105, 0.18, -0.172, -0.374, 0.25,
> 0.043, 0.215, 0.204, 0.19, 0.177, -0.106, -0.143, 0.062, 0.462,
> 0.256, 0.229, 0.314, 0.415, 0.307, 0.238, -0.35, 0.34, 0.275,
> 0.097, 0.353, 0.214, 0.435, 0.055, -0.289, 0.239, 0.186, 0.135,
> 0.259, 0.268, 0.258, 0.032, 0.489, 0.389, 0.298, 0.164, 0.325,
> 0.254, -0.059, 0.524, 0.539, 0.25, 0.175, 0.326, 0.302, -0.047,
> -0.301, -0.149, 0.358, 0.495, 0.311, 0.235, 0.558, -0.156, 0,
> 0.146, 0.329, -0.069, -0.352, -0.356, -0.206, -0.179, 0.467,
> -0.325, 0.39, -0.399, -0.165, 0.267, -0.334, -0.17, 0.58, 0.228,
> 0.234, 0.351, 0.3, -0.018, 0.125, 0.176, 0.322, 0.246, 0.376,
> -0.185, 0.342, 0.142, -0.075, 0.186, 0.333, 0.112, 0.272, 0.277,
> 0.203, 0.37, 0.465, 0.425), VegS = c(14, 11, 18, 21, 31, 20,
> 11, 17, 10, 15, 27, 8, 17, 13, 16, 9, 10, 16, 10, 15, 11, 9,
> 14, 11, 11, 10, 24, 18, 12, 6, 25, 21, 25, 8, 14, 18, 11, 16,
> 20, 16, 10, 16, 18, 14, 13, 11, 15, 23, 11, 28, 12, 17, 12, 18,
> 10, 15, 7, 15, 9, 16, 18, 16, 20, 18, 12, 19, 16, 18, 20, 15,
> 24, 9, 15, 9, 16, 14, 17, 14, 7, 9, 9, 12, 13, 15, 14, 11, 17,
> 8, 14, 15, 12, 8, 10, 12, 8, 16, 15, 22, 16, 21, 10, 15, 20,
> 14, 27, 21, 19, 22, 21, 11, 13, 10, 13, 14, 9, 22, 22, 20, 12,
> 16, 20, 19, 26, 14, 13, 23, 14, 22, 19, 15, 28, 16, 20, 25, 10,
> 19, 0, 10, 8, 11, 17, 13, 17, 23, 37, 32, 19, 26, 12, 11, 24,
> 11, 21, 25, 8, 15, 21, 31, 17, 0, 12, 6, 23, 19, 29, 14, 9, 0,
> 18, 23, 20, 15, 15, 17, 27, 17, 2, 24, 17, 16, 26, 11, 23, 24,
> 10, 26, 21, 12, 20, 12, 29, 22, 20, 16, 41, 19, 27, 28, 10, 35,
> 28, 23, 14, 5, 23, 15, 17, 12, 11, 24, 11, 14, 7, 7, 27, 17,
> 15, 10, 16, 2, 11, 21, 18, 15, 8, 17, 10, 18, 15, 18, 31, 9,
> 14, 11, 19, 22, 8, 19, 17, 18, 25, 11, 17, 32, 25, 18, 21, 19,
> 35, 14, 29, 9, 28, 14), T = c(13, 15.4, 15.6, 12.3, 12.7, 13.3,
> 6, 13, 9.2, 17.8, 9.4, 14.2, 8.7, 14, 8.6, 13.9, 9.2, 15.1, 9.4,
> 16.5, 14, 11.5, 15.5, 13.3, 12.7, 14, 10.5, 14, 10.1, 16.7, 15.2,
> 11.2, 11.7, 17.9, 13.3, 13.9, 8.7, 16.7, 7.8, 7.9, 10.9, 15.5,
> 14, 14.1, 14.3, 13.3, 11.6, 16.5, 12.7, 12.6, 8.3, 9, 12.4, 15,
> 11.8, 14.1, 10.5, 12.4, 10.5, 17.5, 9.2, 16.3, 5.3, 5.9, 11.9,
> 8.9, 7.7, 15.2, 8.6, 13.2, 9.5, 15.8, 12.5, 13.8, 10.7, 10.5,
> 7.7, 11, 9.3, 14.6, 12, 15.4, 12.3, 14, 8.3, 19.8, 15.5, 14.3,
> 9, 7.6, 15.3, 12.8, 14.4, 14, 10.5, 8.9, 13.4, 12.8, 12.9, 11.2,
> 13.1, 10, 12.4, 15.4, 7.6, 14.9, 13.1, 11.1, 8.6, 13.6, 8.4,
> 11.5, 12.5, 15.6, 8.3, 9.6, 8.7, 9.7, 10.5, 12.8, 8.6, 12.7,
> 6.7, 8.5, 9.9, 8.3, 12.8, 9.8, 10.5, 10.7, 9.6, 11.1, 13.7, 9.5,
> 8.8, 3.4, 10.2, 3.5, 8.4, 11.9, 12.3, 10.2, 13.1, 8.4, 3.1, 7.2,
> 13.2, 7.6, 11.3, 13.5, 7.8, 5.5, 10.7, 4.8, 7.9, 13.5, 13.5,
> 2.1, 7.2, 9.7, 12, 9.2, 13.2, 12, 17.1, 7.9, 12.7, 11.8, 16,
> 6.4, 12.9, 8.1, 12.6, 10.2, 13.3, 8.5, 9.7, 9.9, 18.2, 11, 8.4,
> 1.5, 7.3, 10.6, 13.6, 8.4, 7.2, 13, 15.5, 9.7, 13.2, 5.9, 9.5,
> 10.4, 12.9, 2.6, 17.2, 15.4, 10.5, 6.7, 6.6, 7.6, 10.5, 15.6,
> 10.4, 5.1, 11, 9.7, 4.2, 3.6, 8.5, 11.5, 8.4, 6.9, 11, 10.4,
> 3.4, 3.2, 5.5, 2.4, 11.2, 2.6, 15.1, 16, 13.7, 10.5, 3.5, 13.4,
> 11.5, 12.3, 13.9, 14.5, 12.8, 16.8, 16.9, 13.5, 17.2, 12.5, 12.4,
> 11.8, 12, 10.9, 6.7, 10.9, 2.3, 5.2, 13.1, 12.1, 13.9, 12.9,
> 7.2, 12.5, 16, 11.7), Hemlock = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0), Alpine = c(0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0),
> Snow = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("D", "Day", "NDVI",
> "VegS", "T", "Hemlock", "Alpine", "Snow"), row.names = as.integer(c(NA,
> 254)), class = "data.frame")
>
>
> #Running a regression.
> library(MASS)
> fit <- glm.nb(D ~ Day + NDVI + VegS + T + Hemlock + Alpine + Snow, data =
> data)
> summary(fit, correlation = FALSE)
>
> Call:
> glm.nb(formula = D ~ Day + NDVI + VegS + T + Hemlock + Alpine +
> Snow, data = data, init.theta = 11.3494468596771, link = log)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -3.7451 -0.7196 -0.1958 0.5389 2.7096
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -2.882684 0.929598 -3.101 0.001929 **
> Day 0.020325 0.005540 3.669 0.000244 ***
> NDVI 1.353361 0.221471 6.111 9.91e-10 ***
> VegS 0.016731 0.004931 3.393 0.000691 ***
> T 0.074189 0.009491 7.817 5.42e-15 ***
> Hemlock -0.588858 0.174980 -3.365 0.000765 ***
> Alpine -0.452199 0.099296 -4.554 5.26e-06 ***
> Snow -1.902610 0.735708 -2.586 0.009707 **
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for Negative Binomial(11.3494) family taken to be 1)
>
> Null deviance: 515.16 on 253 degrees of freedom
> Residual deviance: 278.89 on 246 degrees of freedom
> AIC: 1300.1
>
> Number of Fisher Scoring iterations: 1
>
>
> Theta: 11.35
> Std. Err.: 2.71
>
> 2 x log-likelihood: -1282.075
>
>
> #Plotting observed versus predicted values.
> pdf(file="ObsVPred.pdf", width=4, height=4, family="Times", pointsize=11)
> par(mar = c(5,5,1,1), pch=1)
> plot(data$D, fit$fitted.values, main="",
> ylab=expression(italic(D)[predicted]),
> xlab=expression(italic(D)[observed]))
> abline(a=0,b=1, lty=2)
> lines(lowess(data$D, fit$fitted.values))
> dev.off()
>
>
> #This appears to be a decent explanatory model, but as a predictive
> model it is systematically biased. It is biased high at low observed
> values of D and biased low at high values observed values of D.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list