[R-sig-ME] bam model selection with 3 million data

Sat Feb 1 20:39:08 CET 2020

Hi David,

1) You cannot perform likelihood-based model comparisons with bam models, or -- for completeness' sake -- with gam models that were fitted using performance iteration or the EFS optimizer. All of these are based on PQL (penalized quasi-likelihood), which makes the log-likelihood (and hence LRT, AIC, BIC, etc) invalid for comparison purposes. See Wood (2017:149-151). gam() with the default outer iteration should be fine, though. Have you tried fitting your full model using bam with the select=TRUE argument to turn on mgcv's automatic smooth-term selection?

2) I am unsure if the deviance explained is or is not suitable for indicating effect size, so I can't comment on this question. I might, however, have an alternative suggestion: have you considered partial eta squared or partial omega squared? You should be able to calculate those based on the ANOVA table.

3) I agree with you that the warning suggests complete separation, but in my experience this doesn't automatically have to be a problem. Have you checked the summary for extremely large beta values, and also have you run gam.check() to see if your fit looks reasonable? If neither indicates a problem I wouldn't be too concerned about it.

Hope this helps,

Cesko

P.S.: please send messages in plain text only, as you can see the formatting of your message was slightly screwed up because the mailing list automatically strips HTML markup

-----Oorspronkelijk bericht-----
Van: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> Namens David Villegas Ríos
Verzonden: zaterdag 1 februari 2020 19:57
Aan: r-sig-mixed-models <r-sig-mixed-models using r-project.org>
Onderwerp: [R-sig-ME] bam model selection with 3 million data

Dear list,

I´m investigating the effect of three variables (X, Y, Z) on the probability that an animal uses a particular habitat A. I have a time series of relocations for each animal (>300 individuals), with one relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects of the three variables are expected to be non-linear so I´m using gam models. My dataset is very large, with >3 million data points so I´m using the bam function from the mgcv library in R. In my models I include a random effect “individual ID”, and a temporal autocorrelation term that corrects much but not all of the autocorrelation in the models.

*Question 1.*

When I run a model with the three main effects (X, Y, Z) and the three double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly significant, except for one interaction. If I remove it, then everything is highly significant. However, I also wanted to run simpler models with only one interaction, no interactions, only two main effects and only one main effect. Then, if I compare all these models with AIC or BIC, I get that the best model (by far) is the one with only main effects.

>
    AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)

                  df      AIC

codcoaAR2   306.1310 -1442543

codcoaAR2.1 293.1608 -1440642

codcoaAR2.2 292.9615 -1438219

codcoaAR2.3 294.3657 -1435346

codcoaAR2.4 284.0026 -1434286

codcoaAR2.5 280.3472 -1396765

codcoaAR2.6 279.6380 -1435862

codcoaAR2.7 269.4968 -1377806

codcoaAR2.8 269.0480 -1393897

codcoaAR2.9 281.8584 -1214270

codcoaAR    271.7066 -2353481  # model with only main effects

I wonder how this is possible if two of the interactions are highly significant.

So my underlying question is: *for a model like this in which sample size is huge, should I make model selection looking at the significance of the different terms in the model, or should I rather look at AIC/BIC?*

*Question 2.*

Let´s assume the model with only main effects is indeed the optimal one.
Then I´d like to get the effect size of each explanatory variable. It´s not clear to me how to do it even after reading some post on this and other forums, but I tried to figure it out by sequentially running the model without one explanatory variable at a time, and then comparing the deviance explained in the optimal model with X, Y, Z with the deviance explained with the reduced model with only Y and Z, for instance. Assuming that the difference would the variance explained by X. *Is this correct? *Looking at the results, the deviance explained by each variable X, Y, Z is quite low, but if the three main effects explain so little variance, who is explaining the rest?

Model

Deviance explained

X, Z, Y

69.3%

Y, Z

68.5%

X, Z

69.3%

X, Y

60.5%

*Question 3.*

In my models I usually get this error message:

Warning message:

In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,  :

  fitted probabilities numerically 0 or 1 occurred

which seems to indicate that there is perfect separation in my logistic regression. I´m not sure this is the case in my data, how could I check it and correct for it if needed? Should it be always corrected?

Thanks for your help,

David

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models using r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models