[R-sig-ME] bam model selection with 3 million data
Voeten, C.C.
c@c@voeten @end|ng |rom hum@|e|denun|v@n|
Sat Feb 1 20:39:08 CET 2020
Hi David,
1) You cannot perform likelihood-based model comparisons with bam models, or -- for completeness' sake -- with gam models that were fitted using performance iteration or the EFS optimizer. All of these are based on PQL (penalized quasi-likelihood), which makes the log-likelihood (and hence LRT, AIC, BIC, etc) invalid for comparison purposes. See Wood (2017:149-151). gam() with the default outer iteration should be fine, though. Have you tried fitting your full model using bam with the select=TRUE argument to turn on mgcv's automatic smooth-term selection?
2) I am unsure if the deviance explained is or is not suitable for indicating effect size, so I can't comment on this question. I might, however, have an alternative suggestion: have you considered partial eta squared or partial omega squared? You should be able to calculate those based on the ANOVA table.
3) I agree with you that the warning suggests complete separation, but in my experience this doesn't automatically have to be a problem. Have you checked the summary for extremely large beta values, and also have you run gam.check() to see if your fit looks reasonable? If neither indicates a problem I wouldn't be too concerned about it.
Hope this helps,
Cesko
P.S.: please send messages in plain text only, as you can see the formatting of your message was slightly screwed up because the mailing list automatically strips HTML markup
-----Oorspronkelijk bericht-----
Van: R-sig-mixed-models <r-sig-mixed-models-bounces using r-project.org> Namens David Villegas Ríos
Verzonden: zaterdag 1 februari 2020 19:57
Aan: r-sig-mixed-models <r-sig-mixed-models using r-project.org>
Onderwerp: [R-sig-ME] bam model selection with 3 million data
Dear list,
I´m investigating the effect of three variables (X, Y, Z) on the probability that an animal uses a particular habitat A. I have a time series of relocations for each animal (>300 individuals), with one relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects of the three variables are expected to be non-linear so I´m using gam models. My dataset is very large, with >3 million data points so I´m using the bam function from the mgcv library in R. In my models I include a random effect “individual ID”, and a temporal autocorrelation term that corrects much but not all of the autocorrelation in the models.
*Question 1.*
When I run a model with the three main effects (X, Y, Z) and the three double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly significant, except for one interaction. If I remove it, then everything is highly significant. However, I also wanted to run simpler models with only one interaction, no interactions, only two main effects and only one main effect. Then, if I compare all these models with AIC or BIC, I get that the best model (by far) is the one with only main effects.
>
AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)
df AIC
codcoaAR2 306.1310 -1442543
codcoaAR2.1 293.1608 -1440642
codcoaAR2.2 292.9615 -1438219
codcoaAR2.3 294.3657 -1435346
codcoaAR2.4 284.0026 -1434286
codcoaAR2.5 280.3472 -1396765
codcoaAR2.6 279.6380 -1435862
codcoaAR2.7 269.4968 -1377806
codcoaAR2.8 269.0480 -1393897
codcoaAR2.9 281.8584 -1214270
codcoaAR 271.7066 -2353481 # model with only main effects
I wonder how this is possible if two of the interactions are highly significant.
So my underlying question is: *for a model like this in which sample size is huge, should I make model selection looking at the significance of the different terms in the model, or should I rather look at AIC/BIC?*
*Question 2.*
Let´s assume the model with only main effects is indeed the optimal one.
Then I´d like to get the effect size of each explanatory variable. It´s not clear to me how to do it even after reading some post on this and other forums, but I tried to figure it out by sequentially running the model without one explanatory variable at a time, and then comparing the deviance explained in the optimal model with X, Y, Z with the deviance explained with the reduced model with only Y and Z, for instance. Assuming that the difference would the variance explained by X. *Is this correct? *Looking at the results, the deviance explained by each variable X, Y, Z is quite low, but if the three main effects explain so little variance, who is explaining the rest?
Model
Deviance explained
X, Z, Y
69.3%
Y, Z
68.5%
X, Z
69.3%
X, Y
60.5%
*Question 3.*
In my models I usually get this error message:
Warning message:
In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef, :
fitted probabilities numerically 0 or 1 occurred
which seems to indicate that there is perfect separation in my logistic regression. I´m not sure this is the case in my data, how could I check it and correct for it if needed? Should it be always corrected?
Thanks for your help,
David
[[alternative HTML version deleted]]
_______________________________________________
R-sig-mixed-models using r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
More information about the R-sig-mixed-models
mailing list