[R-sig-ME] bam model selection with 3 million data
David Villegas Ríos
ch|r|eu @end|ng |rom gm@||@com
Sat Feb 1 19:57:29 CET 2020
Dear list,
I´m investigating the effect of three variables (X, Y, Z) on the
probability that an animal uses a particular habitat A. I have a time
series of relocations for each animal (>300 individuals), with one
relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects
of the three variables are expected to be non-linear so I´m using gam
models. My dataset is very large, with >3 million data points so I´m using
the bam function from the mgcv library in R. In my models I include a
random effect “individual ID”, and a temporal autocorrelation term that
corrects much but not all of the autocorrelation in the models.
*Question 1.*
When I run a model with the three main effects (X, Y, Z) and the three
double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly
significant, except for one interaction. If I remove it, then everything is
highly significant. However, I also wanted to run simpler models with only
one interaction, no interactions, only two main effects and only one main
effect. Then, if I compare all these models with AIC or BIC, I get that the
best model (by far) is the one with only main effects.
>
AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)
df AIC
codcoaAR2 306.1310 -1442543
codcoaAR2.1 293.1608 -1440642
codcoaAR2.2 292.9615 -1438219
codcoaAR2.3 294.3657 -1435346
codcoaAR2.4 284.0026 -1434286
codcoaAR2.5 280.3472 -1396765
codcoaAR2.6 279.6380 -1435862
codcoaAR2.7 269.4968 -1377806
codcoaAR2.8 269.0480 -1393897
codcoaAR2.9 281.8584 -1214270
codcoaAR 271.7066 -2353481 # model with only main effects
I wonder how this is possible if two of the interactions are highly
significant.
So my underlying question is: *for a model like this in which sample size
is huge, should I make model selection looking at the significance of the
different terms in the model, or should I rather look at AIC/BIC?*
*Question 2.*
Let´s assume the model with only main effects is indeed the optimal one.
Then I´d like to get the effect size of each explanatory variable. It´s not
clear to me how to do it even after reading some post on this and other
forums, but I tried to figure it out by sequentially running the model
without one explanatory variable at a time, and then comparing the deviance
explained in the optimal model with X, Y, Z with the deviance explained
with the reduced model with only Y and Z, for instance. Assuming that the
difference would the variance explained by X. *Is this correct? *Looking at
the results, the deviance explained by each variable X, Y, Z is quite low,
but if the three main effects explain so little variance, who is explaining
the rest?
Model
Deviance explained
X, Z, Y
69.3%
Y, Z
68.5%
X, Z
69.3%
X, Y
60.5%
*Question 3.*
In my models I usually get this error message:
Warning message:
In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef, :
fitted probabilities numerically 0 or 1 occurred
which seems to indicate that there is perfect separation in my logistic
regression. I´m not sure this is the case in my data, how could I check it
and correct for it if needed? Should it be always corrected?
Thanks for your help,
David
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list