[R] Systematically biased count data regression model
Matthew and Kim Bowser
matthewnkim at gmail.com
Mon Aug 13 18:51:44 CEST 2007
Dear all,
Thank you for your helpful comments. I have acquired and read
Steyerberg (2001). I may try some of the methods from that paper.
This did prompt a thorough examination of the data. Sure enough, my
worst outlier, where little diversity was observed and the NDVI
measurment was very high, had been stripped of vegetation between the
time of the NDVI data and the arthropod sampling by an avalanche.
Sincerely,
Matthew Bowser
On 8/9/07, Steven McKinney <smckinney at bccrc.ca> wrote:
> Hi Matthew,
>
> You may be experiencing the classic
> 'regression towards the mean' phenomenon,
> in which case shrinkage estimation may help
> with prediction (extremely low and high values
> need to be shrunk back towards the mean)
>
> Here's a reference that discusses the issue
> in a manner somewhat related to your situation,
> and it has plenty of good references
>
>
> Application of Shrinkage Techniques in Logistic Regression Analysis: A Case Study
>
> E. W. Steyerberg
>
> Statistica Neerlandica, 2001, vol. 55, issue 1, pages 76-88
>
>
>
>
> Steven McKinney
>
> Statistician
> Molecular Oncology and Breast Cancer Program
> British Columbia Cancer Research Centre
>
> email: smckinney +at+ bccrc +dot+ ca
>
> tel: 604-675-8000 x7561
>
> BCCRC
> Molecular Oncology
> 675 West 10th Ave, Floor 4
> Vancouver B.C.
> V5Z 1L3
> Canada
>
>
>
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch on behalf of Matthew and Kim Bowser
> Sent: Thu 8/9/2007 8:43 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Systematically biased count data regression model
>
> Dear all,
>
> I am attempting to explain patterns of arthropod family richness
> (count data) using a regression model. It seems to be able to do a
> pretty good job as an explanatory model (i.e. demonstrating
> relationships between dependent and independent variables), but it has
> systematic problems as a predictive model: It is biased high at low
> observed values of family richness and biased low at high observed
> values of family richness (see attached pdf). I have tried diverse
> kinds of reasonable regression models mostly as in Zeileis, et al.
> (2007), as well as transforming my variables, both with only small
> improvements.
>
> Do you have suggestions for making a model that would perform better
> as a predictive model?
>
> Thank you for your time.
>
> Sincerely,
>
> Matthew Bowser
>
> STEP student
> USFWS Kenai National Wildlife Refuge
> Soldotna, Alaska, USA
>
> M.Sc. student
> University of Alaska Fairbanks
> Fairbankse, Alaska, USA
>
> Reference
>
> Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for
> count data in R. Technical Report 53, Department of Statistics and
> Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL
> http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf.
>
> Code
>
> `data` <-
> structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4,
> 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15,
> 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13,
> 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12,
> 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8,
> 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8,
> 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8,
> 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16,
> 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12,
> 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6,
> 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3,
> 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3,
> 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15,
> 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159,
> 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161,
> 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161,
> 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176,
> 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165,
> 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167,
> 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164,
> 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173,
> 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162,
> 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166,
> 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164,
> 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172,
> 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174,
> 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174,
> 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170,
> 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172,
> 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160,
> 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176,
> 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166,
> 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166,
> 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253,
> 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26,
> 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194,
> 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354,
> 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312,
> 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122,
> 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28,
> 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34,
> 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221,
> 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232,
> 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318,
> 0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2,
> 0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38,
> 0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288,
> 0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162,
> 0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306,
> 0.28, 0.279, 0.208, 0.377, 0.413, 0.489, 0.417, 0.333, 0.208,
> 0.232, 0.431, 0.283, 0.241, 0.105, 0.18, -0.172, -0.374, 0.25,
> 0.043, 0.215, 0.204, 0.19, 0.177, -0.106, -0.143, 0.062, 0.462,
> 0.256, 0.229, 0.314, 0.415, 0.307, 0.238, -0.35, 0.34, 0.275,
> 0.097, 0.353, 0.214, 0.435, 0.055, -0.289, 0.239, 0.186, 0.135,
> 0.259, 0.268, 0.258, 0.032, 0.489, 0.389, 0.298, 0.164, 0.325,
> 0.254, -0.059, 0.524, 0.539, 0.25, 0.175, 0.326, 0.302, -0.047,
> -0.301, -0.149, 0.358, 0.495, 0.311, 0.235, 0.558, -0.156, 0,
> 0.146, 0.329, -0.069, -0.352, -0.356, -0.206, -0.179, 0.467,
> -0.325, 0.39, -0.399, -0.165, 0.267, -0.334, -0.17, 0.58, 0.228,
> 0.234, 0.351, 0.3, -0.018, 0.125, 0.176, 0.322, 0.246, 0.376,
> -0.185, 0.342, 0.142, -0.075, 0.186, 0.333, 0.112, 0.272, 0.277,
> 0.203, 0.37, 0.465, 0.425), VegS = c(14, 11, 18, 21, 31, 20,
> 11, 17, 10, 15, 27, 8, 17, 13, 16, 9, 10, 16, 10, 15, 11, 9,
> 14, 11, 11, 10, 24, 18, 12, 6, 25, 21, 25, 8, 14, 18, 11, 16,
> 20, 16, 10, 16, 18, 14, 13, 11, 15, 23, 11, 28, 12, 17, 12, 18,
> 10, 15, 7, 15, 9, 16, 18, 16, 20, 18, 12, 19, 16, 18, 20, 15,
> 24, 9, 15, 9, 16, 14, 17, 14, 7, 9, 9, 12, 13, 15, 14, 11, 17,
> 8, 14, 15, 12, 8, 10, 12, 8, 16, 15, 22, 16, 21, 10, 15, 20,
> 14, 27, 21, 19, 22, 21, 11, 13, 10, 13, 14, 9, 22, 22, 20, 12,
> 16, 20, 19, 26, 14, 13, 23, 14, 22, 19, 15, 28, 16, 20, 25, 10,
> 19, 0, 10, 8, 11, 17, 13, 17, 23, 37, 32, 19, 26, 12, 11, 24,
> 11, 21, 25, 8, 15, 21, 31, 17, 0, 12, 6, 23, 19, 29, 14, 9, 0,
> 18, 23, 20, 15, 15, 17, 27, 17, 2, 24, 17, 16, 26, 11, 23, 24,
> 10, 26, 21, 12, 20, 12, 29, 22, 20, 16, 41, 19, 27, 28, 10, 35,
> 28, 23, 14, 5, 23, 15, 17, 12, 11, 24, 11, 14, 7, 7, 27, 17,
> 15, 10, 16, 2, 11, 21, 18, 15, 8, 17, 10, 18, 15, 18, 31, 9,
> 14, 11, 19, 22, 8, 19, 17, 18, 25, 11, 17, 32, 25, 18, 21, 19,
> 35, 14, 29, 9, 28, 14), T = c(13, 15.4, 15.6, 12.3, 12.7, 13.3,
> 6, 13, 9.2, 17.8, 9.4, 14.2, 8.7, 14, 8.6, 13.9, 9.2, 15.1, 9.4,
> 16.5, 14, 11.5, 15.5, 13.3, 12.7, 14, 10.5, 14, 10.1, 16.7, 15.2,
> 11.2, 11.7, 17.9, 13.3, 13.9, 8.7, 16.7, 7.8, 7.9, 10.9, 15.5,
> 14, 14.1, 14.3, 13.3, 11.6, 16.5, 12.7, 12.6, 8.3, 9, 12.4, 15,
> 11.8, 14.1, 10.5, 12.4, 10.5, 17.5, 9.2, 16.3, 5.3, 5.9, 11.9,
> 8.9, 7.7, 15.2, 8.6, 13.2, 9.5, 15.8, 12.5, 13.8, 10.7, 10.5,
> 7.7, 11, 9.3, 14.6, 12, 15.4, 12.3, 14, 8.3, 19.8, 15.5, 14.3,
> 9, 7.6, 15.3, 12.8, 14.4, 14, 10.5, 8.9, 13.4, 12.8, 12.9, 11.2,
> 13.1, 10, 12.4, 15.4, 7.6, 14.9, 13.1, 11.1, 8.6, 13.6, 8.4,
> 11.5, 12.5, 15.6, 8.3, 9.6, 8.7, 9.7, 10.5, 12.8, 8.6, 12.7,
> 6.7, 8.5, 9.9, 8.3, 12.8, 9.8, 10.5, 10.7, 9.6, 11.1, 13.7, 9.5,
> 8.8, 3.4, 10.2, 3.5, 8.4, 11.9, 12.3, 10.2, 13.1, 8.4, 3.1, 7.2,
> 13.2, 7.6, 11.3, 13.5, 7.8, 5.5, 10.7, 4.8, 7.9, 13.5, 13.5,
> 2.1, 7.2, 9.7, 12, 9.2, 13.2, 12, 17.1, 7.9, 12.7, 11.8, 16,
> 6.4, 12.9, 8.1, 12.6, 10.2, 13.3, 8.5, 9.7, 9.9, 18.2, 11, 8.4,
> 1.5, 7.3, 10.6, 13.6, 8.4, 7.2, 13, 15.5, 9.7, 13.2, 5.9, 9.5,
> 10.4, 12.9, 2.6, 17.2, 15.4, 10.5, 6.7, 6.6, 7.6, 10.5, 15.6,
> 10.4, 5.1, 11, 9.7, 4.2, 3.6, 8.5, 11.5, 8.4, 6.9, 11, 10.4,
> 3.4, 3.2, 5.5, 2.4, 11.2, 2.6, 15.1, 16, 13.7, 10.5, 3.5, 13.4,
> 11.5, 12.3, 13.9, 14.5, 12.8, 16.8, 16.9, 13.5, 17.2, 12.5, 12.4,
> 11.8, 12, 10.9, 6.7, 10.9, 2.3, 5.2, 13.1, 12.1, 13.9, 12.9,
> 7.2, 12.5, 16, 11.7), Hemlock = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0), Alpine = c(0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0),
> Snow = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("D", "Day", "NDVI",
> "VegS", "T", "Hemlock", "Alpine", "Snow"), row.names = as.integer(c(NA,
> 254)), class = "data.frame")
>
>
> #Running a regression.
> library(MASS)
> fit <- glm.nb(D ~ Day + NDVI + VegS + T + Hemlock + Alpine + Snow, data = data)
> summary(fit, correlation = FALSE)
>
> Call:
> glm.nb(formula = D ~ Day + NDVI + VegS + T + Hemlock + Alpine +
> Snow, data = data, init.theta = 11.3494468596771, link = log)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -3.7451 -0.7196 -0.1958 0.5389 2.7096
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -2.882684 0.929598 -3.101 0.001929 **
> Day 0.020325 0.005540 3.669 0.000244 ***
> NDVI 1.353361 0.221471 6.111 9.91e-10 ***
> VegS 0.016731 0.004931 3.393 0.000691 ***
> T 0.074189 0.009491 7.817 5.42e-15 ***
> Hemlock -0.588858 0.174980 -3.365 0.000765 ***
> Alpine -0.452199 0.099296 -4.554 5.26e-06 ***
> Snow -1.902610 0.735708 -2.586 0.009707 **
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> (Dispersion parameter for Negative Binomial(11.3494) family taken to be 1)
>
> Null deviance: 515.16 on 253 degrees of freedom
> Residual deviance: 278.89 on 246 degrees of freedom
> AIC: 1300.1
>
> Number of Fisher Scoring iterations: 1
>
>
> Theta: 11.35
> Std. Err.: 2.71
>
> 2 x log-likelihood: -1282.075
>
>
> #Plotting observed versus predicted values.
> pdf(file="ObsVPred.pdf", width=4, height=4, family="Times", pointsize=11)
> par(mar = c(5,5,1,1), pch=1)
> plot(data$D, fit$fitted.values, main="",
> ylab=expression(italic(D)[predicted]),
> xlab=expression(italic(D)[observed]))
> abline(a=0,b=1, lty=2)
> lines(lowess(data$D, fit$fitted.values))
> dev.off()
>
>
> #This appears to be a decent explanatory model, but as a predictive
> model it is systematically biased. It is biased high at low observed
> values of D and biased low at high values observed values of D.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
More information about the R-help
mailing list