[R-sig-eco] GAM (mgcv) model & variable importance

Tue Apr 23 17:54:44 CEST 2019

Dear R list,

I’m trying to do a GAM model / variable importance using the mgcv R package. I’ve been googling around, and found posts <https://people.maths.bris.ac.uk/~sw15190/mgcv/tampere/mgcv-advanced.pdf> stating that select=T is a good way to "penalized out” less important variables.  

m1 <- bam(Y ~ s(time, by=grp, k=40, bs="cc") + s(host_id, bs="re") + s(grp, bs="re") + s(tempmax_mean) + s(rain_mean) + s(area_tot) + s(area_unique) + s(diet_PCA1) + s(diet_PCA2) + s(diet_PCA3), data=df_subset, select=T)

>From the summary of m1 we can see that s(grp) and s(area_tot) are not significant (they weren’t significant for select=F either, but their F and p-value is smaller when select=T). Ok, so far so good (continues after summary).

summary(m1)
Family: gaussian 
Link function: identity 

Formula:
Dim1 ~ s(time2, by = grp, k = 40, bs = "cc") + s(host_id, bs = "re") + 
    s(grp, bs = "re") + s(tempmax_mean) + s(rain_mean) + s(area_tot) + 
    s(frac_uniq) + s(diet_PC1) + s(diet_PC2) + s(diet_PC3)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.03382    0.03636    0.93    0.352

Approximate significance of smooth terms:
                     edf Ref.df      F  p-value    
s(time):grp1.1   31.398     38 14.326  < 2e-16 ***
s(time):grp1.21  31.370     38 22.103  < 2e-16 ***
s(time):grp1.22  33.178     38 38.372  < 2e-16 ***
s(time):grp2.1   31.892     38 17.257  < 2e-16 ***
s(time):grp2.2   27.425     38 38.245  < 2e-16 ***
s(host_id)         379.858    453 14.665  < 2e-16 ***
s(grp)             1.874      4  8.478  0.29624    
s(tempmax_mean)    8.064      9 17.769  < 2e-16 ***
s(rain_mean)       8.056      9 15.347  < 2e-16 ***
s(area_tot)        1.597      9  0.649  0.08644 .  
s(area_unique)       4.302      9  8.697  0.00113 ** 
s(diet_PC1)        5.243      9 32.816  < 2e-16 ***
s(diet_PC2)        4.020      9  3.679 1.07e-05 ***
s(diet_PC3)        5.116      9 11.008  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.468   Deviance explained = 49.2%
fREML =  15059  Scale est. = 0.52766   n = 12823

I came across another thread <http://r.789695.n4.nabble.com/Re-variance-explained-by-each-predictor-in-GAM-td896222.html> which stated that you could calculate variance explained by each predictor doing as below. So I did this for my model m1 (please see Figure1) and plotted deviance explained (please see Figure2) and AIC (please see Figure3) for each model. While the deviance explained for Model b-j only somewhat increases, the AIC between Model e-f decreases a quite lot (from 34422.59 to 29508.39). So it would seem that the inclusion of s(area_tot) increases model fit. However, as seen in the summary of m1 above (as well as the summary for model f; not shown), the term s(area_tot) is not significant in the models. 

So what may cause this discrepancy? In this case, can I not use AIC to compare between models? While I would use model m1 with select=T, I figured this could be a good way to visualize model fit / variable importance. 

Many thanks for any pointers,

Cheers,
Johannes

------------------------------------------
Johannes R Björk, PhD
Postdoctoral research fellow
Archie Lab
Department of Biological Sciences
University of Notre Dame, IN, USA
Alt. email: rbjork using nd.edu
Skype: johannesbjork

## Example on how to calculate variance explained by each predictor
dat <- gamSim(1,n=400,dist="normal",scale=2) 
b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat) 
b2<-gam(y~s(x0)+s(x1)+s(x3),sp=b$sp[-3],data=dat) 
summary(b2)$dev.expl 
summary(b)$dev.expl 

Figure1

Figure2

Figure3