[R-sig-ME] bam model selection with 3 million data

Sat Feb 1 19:57:29 CET 2020

Dear list,

I´m investigating the effect of three variables (X, Y, Z) on the
probability that an animal uses a particular habitat A. I have a time
series of relocations for each animal (>300 individuals), with one
relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects
of the three variables are expected to be non-linear so I´m using gam
models. My dataset is very large, with >3 million data points so I´m using
the bam function from the mgcv library in R. In my models I include a
random effect “individual ID”, and a temporal autocorrelation term that
corrects much but not all of the autocorrelation in the models.

*Question 1.*

When I run a model with the three main effects (X, Y, Z) and the three
double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly
significant, except for one interaction. If I remove it, then everything is
highly significant. However, I also wanted to run simpler models with only
one interaction, no interactions, only two main effects and only one main
effect. Then, if I compare all these models with AIC or BIC, I get that the
best model (by far) is the one with only main effects.

>
    AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)

                  df      AIC

codcoaAR2   306.1310 -1442543

codcoaAR2.1 293.1608 -1440642

codcoaAR2.2 292.9615 -1438219

codcoaAR2.3 294.3657 -1435346

codcoaAR2.4 284.0026 -1434286

codcoaAR2.5 280.3472 -1396765

codcoaAR2.6 279.6380 -1435862

codcoaAR2.7 269.4968 -1377806

codcoaAR2.8 269.0480 -1393897

codcoaAR2.9 281.8584 -1214270

codcoaAR    271.7066 -2353481  # model with only main effects

I wonder how this is possible if two of the interactions are highly
significant.

So my underlying question is: *for a model like this in which sample size
is huge, should I make model selection looking at the significance of the
different terms in the model, or should I rather look at AIC/BIC?*

*Question 2.*

Let´s assume the model with only main effects is indeed the optimal one.
Then I´d like to get the effect size of each explanatory variable. It´s not
clear to me how to do it even after reading some post on this and other
forums, but I tried to figure it out by sequentially running the model
without one explanatory variable at a time, and then comparing the deviance
explained in the optimal model with X, Y, Z with the deviance explained
with the reduced model with only Y and Z, for instance. Assuming that the
difference would the variance explained by X. *Is this correct? *Looking at
the results, the deviance explained by each variable X, Y, Z is quite low,
but if the three main effects explain so little variance, who is explaining
the rest?

Model

Deviance explained

X, Z, Y

69.3%

Y, Z

68.5%

X, Z

69.3%

X, Y

60.5%

*Question 3.*

In my models I usually get this error message:

Warning message:

In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,  :

  fitted probabilities numerically 0 or 1 occurred

which seems to indicate that there is perfect separation in my logistic
regression. I´m not sure this is the case in my data, how could I check it
and correct for it if needed? Should it be always corrected?

Thanks for your help,

David

	[[alternative HTML version deleted]]