[R-sig-ME] bam model selection with 3 million data

Tue Feb 4 10:51:57 CET 2020

Thanks Cesko.
I tried your suggestion (using the select argument) and got to a reasonable
optimal model in which I keep all the main effects and interactions except
for one main effect.
I tried, however, to do model selection using the buildbam function
(buildmer library) and I get a different optimal model, where all effects
are highly significant. I´m not sure why. According to the buildmer package
documentation:

 "As bam uses PQL, only crit='deviance' is supported."
where
crit=Character string or vector determining the criterion used to test
terms for elimination. Possible options are 'LRT' (likelihood-ratio test;
this is the default), 'LL' (use the raw -2 log likelihood), 'AIC' (Akaike
Information Criterion), 'BIC' (Bayesian Information Criterion), and
'deviance' (explained deviance – note that this is not a formal test)

I´m not sure if you are familiar with buildmer package though.

Thanks,

David

El lun., 3 feb. 2020 a las 13:09, Voeten, C.C. (<
c.c.voeten using hum.leidenuniv.nl>) escribió:

> Hi David,
>
> Please keep the mailing list in cc.
>
> Yes, gam() can indeed be prohibitively slow, hence why I suggested the
> select=TRUE approach. I hope the below explanation helps.
>
> Smooth terms are composed of a null space and a range space, which
> respectively represent the completely smooth part of the effect and the
> wiggly part of the effect. In the mathematical representation, these are
> equivalent to fixed effects and random effects, respectively. As such, the
> range space is subject to penalization and can shrink to zero. In GAM
> terms, this would correspond to a smoothing parameter tending to infinity,
> resulting in a completely smooth effect: you would be left only with the
> null space. What select=TRUE does is add an additional penalty to all null
> spaces, so that these too can shrink towards zero. Effects which are shrunk
> to (near) zero in this way can then be removed from the model by the user.
>
> The degree of shrinkage can be seen by looking at the edf. Smooths that
> are (near) zero will also use (near) zero edf. In practice, when I do model
> selection using select=TRUE, I always remove terms whose edf < 1. Then I
> fit the model again with select=TRUE to ensure that no further reduction is
> necessary, and then I fit a final model with select=FALSE.
>
> Best,
> Cesko
>
> -----Oorspronkelijk bericht-----
> Van: David Villegas Ríos <chirleu using gmail.com>
> Verzonden: maandag 3 februari 2020 11:28
> Aan: Voeten, C.C. <c.c.voeten using hum.leidenuniv.nl>
> Onderwerp: Re: [R-sig-ME] bam model selection with 3 million data
>
> Thanks Cesko, really appreciate your answer.
> In my particular case however, I cannot run gam models (they take forever
> to run) so I´m still struggling on how to perform model selection in bam. I
> tried select=TRUE and the summary changed a bit, but I don´t really know
> what this argument is doing behind the scenes, or whether I should trust
> the summary from the model using select=TRUE rather than the default
> select=FALSE.
> Best wishes,
> David
>
> El sáb., 1 feb. 2020 a las 20:39, Voeten, C.C. (<mailto:
> c.c.voeten using hum.leidenuniv.nl>) escribió:
> Hi David,
>
> 1) You cannot perform likelihood-based model comparisons with bam models,
> or -- for completeness' sake -- with gam models that were fitted using
> performance iteration or the EFS optimizer. All of these are based on PQL
> (penalized quasi-likelihood), which makes the log-likelihood (and hence
> LRT, AIC, BIC, etc) invalid for comparison purposes. See Wood
> (2017:149-151). gam() with the default outer iteration should be fine,
> though. Have you tried fitting your full model using bam with the
> select=TRUE argument to turn on mgcv's automatic smooth-term selection?
>
> 2) I am unsure if the deviance explained is or is not suitable for
> indicating effect size, so I can't comment on this question. I might,
> however, have an alternative suggestion: have you considered partial eta
> squared or partial omega squared? You should be able to calculate those
> based on the ANOVA table.
>
> 3) I agree with you that the warning suggests complete separation, but in
> my experience this doesn't automatically have to be a problem. Have you
> checked the summary for extremely large beta values, and also have you run
> gam.check() to see if your fit looks reasonable? If neither indicates a
> problem I wouldn't be too concerned about it.
>
> Hope this helps,
>
> Cesko
>
> P.S.: please send messages in plain text only, as you can see the
> formatting of your message was slightly screwed up because the mailing list
> automatically strips HTML markup
>
> -----Oorspronkelijk bericht-----
> Van: R-sig-mixed-models <mailto:r-sig-mixed-models-bounces using r-project.org>
> Namens David Villegas Ríos
> Verzonden: zaterdag 1 februari 2020 19:57
> Aan: r-sig-mixed-models <mailto:r-sig-mixed-models using r-project.org>
> Onderwerp: [R-sig-ME] bam model selection with 3 million data
>
> Dear list,
>
> I´m investigating the effect of three variables (X, Y, Z) on the
> probability that an animal uses a particular habitat A. I have a time
> series of relocations for each animal (>300 individuals), with one
> relocation every 30 minutes. There are only two options for the response
> variable: 1=present in habitat A, 0=not present in habitat A. The effects
> of the three variables are expected to be non-linear so I´m using gam
> models. My dataset is very large, with >3 million data points so I´m using
> the bam function from the mgcv library in R. In my models I include a
> random effect “individual ID”, and a temporal autocorrelation term that
> corrects much but not all of the autocorrelation in the models.
>
> *Question 1.*
>
> When I run a model with the three main effects (X, Y, Z) and the three
> double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly
> significant, except for one interaction. If I remove it, then everything is
> highly significant. However, I also wanted to run simpler models with only
> one interaction, no interactions, only two main effects and only one main
> effect. Then, if I compare all these models with AIC or BIC, I get that the
> best model (by far) is the one with only main effects.
>
> >
>
> AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)
>
>                   df      AIC
>
> codcoaAR2   306.1310 -1442543
>
> codcoaAR2.1 293.1608 -1440642
>
> codcoaAR2.2 292.9615 -1438219
>
> codcoaAR2.3 294.3657 -1435346
>
> codcoaAR2.4 284.0026 -1434286
>
> codcoaAR2.5 280.3472 -1396765
>
> codcoaAR2.6 279.6380 -1435862
>
> codcoaAR2.7 269.4968 -1377806
>
> codcoaAR2.8 269.0480 -1393897
>
> codcoaAR2.9 281.8584 -1214270
>
> codcoaAR    271.7066 -2353481  # model with only main effects
>
>
>
> I wonder how this is possible if two of the interactions are highly
> significant.
>
> So my underlying question is: *for a model like this in which sample size
> is huge, should I make model selection looking at the significance of the
> different terms in the model, or should I rather look at AIC/BIC?*
>
> *Question 2.*
>
> Let´s assume the model with only main effects is indeed the optimal one.
> Then I´d like to get the effect size of each explanatory variable. It´s
> not clear to me how to do it even after reading some post on this and other
> forums, but I tried to figure it out by sequentially running the model
> without one explanatory variable at a time, and then comparing the deviance
> explained in the optimal model with X, Y, Z with the deviance explained
> with the reduced model with only Y and Z, for instance. Assuming that the
> difference would the variance explained by X. *Is this correct? *Looking at
> the results, the deviance explained by each variable X, Y, Z is quite low,
> but if the three main effects explain so little variance, who is explaining
> the rest?
>
> Model
>
> Deviance explained
>
> X, Z, Y
>
> 69.3%
>
> Y, Z
>
> 68.5%
>
> X, Z
>
> 69.3%
>
> X, Y
>
> 60.5%
>
>
>
> *Question 3.*
>
> In my models I usually get this error message:
>
> Warning message:
>
> In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,  :
>
>   fitted probabilities numerically 0 or 1 occurred
>
>
>
> which seems to indicate that there is perfect separation in my logistic
> regression. I´m not sure this is the case in my data, how could I check it
> and correct for it if needed? Should it be always corrected?
>
>
>
> Thanks for your help,
>
> David
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> mailto:R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>

	[[alternative HTML version deleted]]