[R-sig-eco] GAM (mgcv) model & variable importance
Johannes Björk
bjork@joh@nne@ @end|ng |rom gm@||@com
Tue Apr 23 17:54:44 CEST 2019
Dear R list,
I’m trying to do a GAM model / variable importance using the mgcv R package. I’ve been googling around, and found posts <https://people.maths.bris.ac.uk/~sw15190/mgcv/tampere/mgcv-advanced.pdf> stating that select=T is a good way to "penalized out” less important variables.
m1 <- bam(Y ~ s(time, by=grp, k=40, bs="cc") + s(host_id, bs="re") + s(grp, bs="re") + s(tempmax_mean) + s(rain_mean) + s(area_tot) + s(area_unique) + s(diet_PCA1) + s(diet_PCA2) + s(diet_PCA3), data=df_subset, select=T)
>From the summary of m1 we can see that s(grp) and s(area_tot) are not significant (they weren’t significant for select=F either, but their F and p-value is smaller when select=T). Ok, so far so good (continues after summary).
summary(m1)
Family: gaussian
Link function: identity
Formula:
Dim1 ~ s(time2, by = grp, k = 40, bs = "cc") + s(host_id, bs = "re") +
s(grp, bs = "re") + s(tempmax_mean) + s(rain_mean) + s(area_tot) +
s(frac_uniq) + s(diet_PC1) + s(diet_PC2) + s(diet_PC3)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03382 0.03636 0.93 0.352
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time):grp1.1 31.398 38 14.326 < 2e-16 ***
s(time):grp1.21 31.370 38 22.103 < 2e-16 ***
s(time):grp1.22 33.178 38 38.372 < 2e-16 ***
s(time):grp2.1 31.892 38 17.257 < 2e-16 ***
s(time):grp2.2 27.425 38 38.245 < 2e-16 ***
s(host_id) 379.858 453 14.665 < 2e-16 ***
s(grp) 1.874 4 8.478 0.29624
s(tempmax_mean) 8.064 9 17.769 < 2e-16 ***
s(rain_mean) 8.056 9 15.347 < 2e-16 ***
s(area_tot) 1.597 9 0.649 0.08644 .
s(area_unique) 4.302 9 8.697 0.00113 **
s(diet_PC1) 5.243 9 32.816 < 2e-16 ***
s(diet_PC2) 4.020 9 3.679 1.07e-05 ***
s(diet_PC3) 5.116 9 11.008 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.468 Deviance explained = 49.2%
fREML = 15059 Scale est. = 0.52766 n = 12823
I came across another thread <http://r.789695.n4.nabble.com/Re-variance-explained-by-each-predictor-in-GAM-td896222.html> which stated that you could calculate variance explained by each predictor doing as below. So I did this for my model m1 (please see Figure1) and plotted deviance explained (please see Figure2) and AIC (please see Figure3) for each model. While the deviance explained for Model b-j only somewhat increases, the AIC between Model e-f decreases a quite lot (from 34422.59 to 29508.39). So it would seem that the inclusion of s(area_tot) increases model fit. However, as seen in the summary of m1 above (as well as the summary for model f; not shown), the term s(area_tot) is not significant in the models.
So what may cause this discrepancy? In this case, can I not use AIC to compare between models? While I would use model m1 with select=T, I figured this could be a good way to visualize model fit / variable importance.
Many thanks for any pointers,
Cheers,
Johannes
------------------------------------------
Johannes R Björk, PhD
Postdoctoral research fellow
Archie Lab
Department of Biological Sciences
University of Notre Dame, IN, USA
Alt. email: rbjork using nd.edu
Skype: johannesbjork
## Example on how to calculate variance explained by each predictor
dat <- gamSim(1,n=400,dist="normal",scale=2)
b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
b2<-gam(y~s(x0)+s(x1)+s(x3),sp=b$sp[-3],data=dat)
summary(b2)$dev.expl
summary(b)$dev.expl
Figure1
Figure2
Figure3
More information about the R-sig-ecology
mailing list