[R] choosing between Poisson regression models: no interactions vs. interactions
Kyle.
ambertk at ohsu.edu
Tue Jul 31 21:23:40 CEST 2007
If you have access to it, chapter 7, section three of Venables and
Ripley's "Modern Applied Statistics with S" is probably going to tell
you everything you need to know. In general, let's say you're
wanting to look at the following:
model 1: Total.vines ~ Site + Species + DBH
model 2: Total.vines ~ Site * Species + DBH
Download the MASS package and load it into the library
> library(MASS)
> model1<-glm(Total.vines~Site+Specites+DBH,family=poisson)
> model2<-addterm(model1, . ~ .*Species+DBH,test="Chisq")
This should return a table including the respective AICs, likelihood
ratio tests, and the p-value associated with whether or not model2 is
significantly different than model 1. In general though, it's best
to choose the model which minimizes the AIC. Based on the output you
included, it appears that the model including the interaction term is
better, but the above might give you the information you seek.
Kyle H. Ambert
Graduate Student, Dept. Behavioral Neuroscience
Oregon Health & Science University
ambertk at ohsu.edu
On Jul 31, 2007, at 11:13 AM, James Milks wrote:
> R gurus,
>
> I'm working on data analysis for a small project. My response
> variable is total vines per tree (median = 0, mean = 1.65, min = 0,
> max = 24). My predictors are two categorical variables (four sites
> and four species) and one continuous (tree diameter at breast height
> (DBH)). The main question I'm attempting to answer is whether or not
> the species identity of a tree has any effects on the number of vines
> clinging to the trunk. Given that the response variable is count
> data, I decided to use Poisson regression, even though I'm not as
> familiar with it as linear or logit regression.
>
> My problem is deciding which model to use. I have created several,
> one without interaction terms (Total.vines~Site+Species+DBH), one
> with an interaction term between Site and Species
> (Total.vines~Site*Species+DBH), and one with interactions between all
> variables (Total.vines~Site*Species*DBH). Here is my output from R
> for the first two models (the last model has the same number (and
> identity) of significant variables as the second model, even though
> the last model had more interaction terms overall):
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Call:
> glm(formula = Total.vines ~ Site + Species + DBH, family = poisson)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -5.2067 -1.2915 -0.7095 -0.3525 6.3756
>
> Coefficients:
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -2.987695 0.231428 -12.910 < 2e-16 ***
> SiteHuffman Dam 2.725193 0.249423 10.926 < 2e-16 ***
> SiteNarrows 1.902987 0.227599 8.361 < 2e-16 ***
> SiteSugar Creek 1.752754 0.242186 7.237 4.58e-13 ***
> SpeciesFRAM 0.955468 0.157423 6.069 1.28e-09 ***
> SpeciesPLOC 1.187903 0.141707 8.383 < 2e-16 ***
> SpeciesULAM 0.340792 0.184615 1.846 0.0649 .
> DBH 0.020708 0.001292 16.026 < 2e-16 ***
> ---
> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> (Dispersion parameter for poisson family taken to be 1)
>
> Null deviance: 1972.3 on 544 degrees of freedom
> Residual deviance: 1290.0 on 537 degrees of freedom
> AIC: 1796.0
>
> Number of Fisher Scoring iterations: 6
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Call:
> glm(formula = Total.vines ~ Site * Species + DBH, family = poisson,
> data = sycamores.1)
>
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -4.9815 -1.2370 -0.6339 -0.3403 6.5664
>
> Coefficients: (3 not defined because of singularities)
> Estimate Std. Error z value Pr(>|z|)
> (Intercept) -2.788243 0.303064 -9.200 < 2e-16 ***
> SiteHuffman Dam 1.838952 0.354127 5.193 2.07e-07 ***
> SiteNarrows 2.252716 0.323184 6.970 3.16e-12 ***
> SiteSugar Creek -12.961519 519.152077 -0.025 0.980082
> SpeciesFRAM 13.938716 519.152230 0.027 0.978580
> SpeciesPLOC 0.240223 0.540676 0.444 0.656824
> SpeciesULAM 1.919586 0.540246 3.553 0.000381 ***
> DBH 0.019984 0.001337 14.946 < 2e-16 ***
> SiteHuffman Dam:SpeciesFRAM -11.513823 519.152294 -0.022 0.982306
> SiteNarrows:SpeciesFRAM -13.593127 519.152268 -0.026 0.979111
> SiteSugar Creek:SpeciesFRAM NA NA NA NA
> SiteHuffman Dam:SpeciesPLOC NA NA NA NA
> SiteNarrows:SpeciesPLOC 0.397503 0.555218 0.716 0.474028
> SiteSugar Creek:SpeciesPLOC 15.640450 519.152277 0.030 0.975966
> SiteHuffman Dam:SpeciesULAM -0.102841 0.610027 -0.169 0.866124
> SiteNarrows:SpeciesULAM -2.809092 0.606804 -4.629 3.67e-06 ***
> SiteSugar Creek:SpeciesULAM NA NA NA NA
> ---
> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> (Dispersion parameter for poisson family taken to be 1)
>
> Null deviance: 1972.3 on 544 degrees of freedom
> Residual deviance: 1178.7 on 531 degrees of freedom
> AIC: 1696.6
>
> Number of Fisher Scoring iterations: 13
> %%%%%%%%%%%%%%%%%%%%
>
> As you can see, the two models give very different output, especially
> in regards to whether or not the individual species are significant.
> In the no-interaction model, the only species that was not
> significant was ULAM. In the one-way interaction model, ULAM was the
> only significant species. My question is this: which model should I
> use when I present this analysis? I know that the one-way
> interaction model has the lower AIC. Should I base my choice solely
> on AIC? The reasons I'm asking is that the second model has only one
> significant interaction term, fewer significant terms overall, and
> three undefined terms.
>
> Thanks for any guidance you can give to someone running his first
> Poisson regression.
>
> Jim Milks
>
> Graduate Student
> Environmental Sciences Ph.D. Program
> 136 Biological Sciences
> Wright State University
> 3640 Colonel Glenn Hwy
> Dayton, OH 45435
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list