[R-sig-ME] bam model selection with 3 million data

Tue Feb 4 13:46:38 CET 2020

If it helps: the compareML() help says that "model comparison is only
implemented for the methods GCV, fREML, REML, and ML."
(so GCV seems possible.)

In the following vignette, it is additionally stated that:
"For testing the difference in fixed effects predictors the method
fREML does not provide the most reliable test. Rather use ML. However,
ML takes longer to run and penalizes wigglyness more."

https://cran.r-project.org/web/packages/itsadug/vignettes/test.html

João

On Tue, 2020-02-04 at 13:30 +0100, Cesko Voeten wrote:
> Hi David,
> 
> I've never used compareML myself, but after taking a quick look at
> the code it seems that it's just doing a likelihood-ratio test, which
> means that the same caveats apply. So if my understanding of this
> function is correct, it can be used with gam models fitted using
> outer iteration with method="ML" or method="REML" (which are based on
> penalized (restricted) maximum likelihood), but not with gam models
> fitted using GCV/UBRE/Cp (which are not likelihood-based), gam models
> fitted using non-default optimizers (these use PQL), or bam models
> (PQL). What I mean by this is that, while the numerical comparison of
> the optimized log-likelihood values may very well work for selecting
> the best model, the penalized quasilikelihood is not a true
> likelihood and hence cannot fall back on Wilks's theorem that the
> difference in log-likelihoods is chi-square-distributed. The
> analogous reasoning goes for AIC and BIC. But that doesn't mean that
> such comparisons are useless -- remember that all models are wrong,
> but some are useful.
> 
> Note that the only thing select=TRUE does is enable additional
> penalization of the smooths; it is up to you to determine how much
> shrinkage you deem enough to warrant removing a term from the model.
> So if you feel that your one main effect is important to you, you can
> always choose to leave it in. However, if you go this route, might I
> suggest taking a look at the parsimonious mixed models paper (
> https://arxiv.org/abs/1506.04967), where the bottom line is: if you
> have reasons to expect a term to be important or unimportant, why
> even bother with selection procedures and why not just fit the model
> that you believe represents your data the best? (In fact, I
> personally would only use stepwise-selection methods if I wasn't sure
> whether a term is or is not important, particularly with respect to
> achieving convergence...)
> 
> Best,
> Cesko
> 
> Op 4-2-2020 om 12:30 schreef David Villegas Ríos:
> > Thanks Cesko,
> > I also tried the /compareML/ function (itsadug library) to compare
> > 1) the model with all terms included (the best model according to
> > "buildbam") and 2) the model with one main effect out (the best
> > according to select=TRUE procedure) and /compareML/ supports the
> > full model (same as buildbam) when both models 1) and 2) are fit
> > using select=FALSE.
> > Just in case you have experience with /compareML /too:)
> > David
> > 
> > 
> > 
> > El mar., 4 feb. 2020 a las 11:27, Cesko Voeten (<
> > c.c.voeten using hum.leidenuniv.nl <mailto:c.c.voeten using hum.leidenuniv.nl>>
> > ) escribió:
> > 
> >     Hi David,
> > 
> >     I wrote that package, so yes, I am very familiar with it :) :)
> > 
> >     buildbam is based on the explained deviance. A term is included
> > if the explained deviance with the term is higher than the
> > explained deviance without it. This means that buildbam will favor
> > models that contain (too?) many effects, as effects that model only
> > noise but still explain perhaps 0.0001% more deviance due to chance
> > will still be included. To be honest with you, I have come to
> > believe that the buildbam function was a mistake in the first
> > place. I had coded buildgam already and buildbam was an easy
> > extension, and it was only later that I realized that bam using PQL
> > meant that only the explained deviance would be a valid model-
> > comparison criterion. But note that the explained deviance is not
> > actually a _formal_ criterion, like the likelihood-ratio test is
> > (which uses a well-known result that differences in log-likelihood
> > are asymptotically chi-square-distributed) or like AIC or BIC
> > (which are respectively based on information theory and on Bayesian
> > inference). I had
> >     wanted to remove buildbam for a long time, but this would be
> > wrong of me to do because now that it's out there people may be
> > using it, and given the provided caveat that it is can only use the
> > explained deviance, they may actually have a use case for it. Your
> > message makes me think very seriously about officially deprecating
> > this function in buildmer's next release, or at least issue a
> > warning that explained deviance is not a formal criterion and will
> > favor overfitted models. In fact, I should deprecate that as well
> > -- I had only put it in for buildbam's use.
> > 
> >     If buildbam and bam(select=TRUE) conflict, I would definitely
> > trust bam(select=TRUE) over buildbam, as mgcv's automatic
> > smoothness selection is a method derived from first principles by
> > extending the approach used to fit smooth terms anyway in a very
> > straightforward way. On the other hand, the explained deviance is
> > completely ad hoc and has no formal justification. I'll just bite
> > the bullet and deprecate buildbam, advising people to use buildgam
> > or select=TRUE instead.
> > 
> >     Thanks for helping me bite that bullet,
> > 
> >     Cesko
> > 
> >     Op 4-2-2020 om 10:51 schreef David Villegas Ríos:
> >      > Thanks Cesko.
> >      > I tried your suggestion (using the select argument) and got
> > to a reasonable optimal model in which I keep all the main effects
> > and interactions except for one main effect.
> >      > I tried, however, to do model selection using the buildbam
> > function (buildmer library) and I get a different optimal model,
> > where all effects are highly significant. I´m not sure why.
> > According to the buildmer package documentation:
> >      >
> >      >   "As bam uses PQL, only crit='deviance' is supported."
> >      > where
> >      > crit=Character string or vector determining the criterion
> > used to test terms for elimination. Possible options are 'LRT'
> > (likelihood-ratio test; this is the default), 'LL' (use the raw -2
> > log likelihood), 'AIC' (Akaike Information Criterion), 'BIC'
> > (Bayesian Information Criterion), and 'deviance' (explained
> > deviance – note that this is not a formal test)
> >      >
> >      > I´m not sure if you are familiar with buildmer package
> > though.
> >      >
> >      > Thanks,
> >      >
> >      > David
> >      >
> >      > El lun., 3 feb. 2020 a las 13:09, Voeten, C.C. (<
> > c.c.voeten using hum.leidenuniv.nl <mailto:c.c.voeten using hum.leidenuniv.nl>
> > <mailto:c.c.voeten using hum.leidenuniv.nl <mailto:
> > c.c.voeten using hum.leidenuniv.nl>>>) escribió:
> >      >
> >      >     Hi David,
> >      >
> >      >     Please keep the mailing list in cc.
> >      >
> >      >     Yes, gam() can indeed be prohibitively slow, hence why I
> > suggested the select=TRUE approach. I hope the below explanation
> > helps.
> >      >
> >      >     Smooth terms are composed of a null space and a range
> > space, which respectively represent the completely smooth part of
> > the effect and the wiggly part of the effect. In the mathematical
> > representation, these are equivalent to fixed effects and random
> > effects, respectively. As such, the range space is subject to
> > penalization and can shrink to zero. In GAM terms, this would
> > correspond to a smoothing parameter tending to infinity, resulting
> > in a completely smooth effect: you would be left only with the null
> > space. What select=TRUE does is add an additional penalty to all
> > null spaces, so that these too can shrink towards zero. Effects
> > which are shrunk to (near) zero in this way can then be removed
> > from the model by the user.
> >      >
> >      >     The degree of shrinkage can be seen by looking at the
> > edf. Smooths that are (near) zero will also use (near) zero edf. In
> > practice, when I do model selection using select=TRUE, I always
> > remove terms whose edf < 1. Then I fit the model again with
> > select=TRUE to ensure that no further reduction is necessary, and
> > then I fit a final model with select=FALSE.
> >      >
> >      >     Best,
> >      >     Cesko
> >      >
> >      >     -----Oorspronkelijk bericht-----
> >      >     Van: David Villegas Ríos <chirleu using gmail.com <mailto:
> > chirleu using gmail.com> <mailto:chirleu using gmail.com <mailto:
> > chirleu using gmail.com>>>
> >      >     Verzonden: maandag 3 februari 2020 11:28
> >      >     Aan: Voeten, C.C. <c.c.voeten using hum.leidenuniv.nl <mailto:
> > c.c.voeten using hum.leidenuniv.nl> <mailto:c.c.voeten using hum.leidenuniv.nl
> > <mailto:c.c.voeten using hum.leidenuniv.nl>>>
> >      >     Onderwerp: Re: [R-sig-ME] bam model selection with 3
> > million data
> >      >
> >      >     Thanks Cesko, really appreciate your answer.
> >      >     In my particular case however, I cannot run gam models
> > (they take forever to run) so I´m still struggling on how to
> > perform model selection in bam. I tried select=TRUE and the summary
> > changed a bit, but I don´t really know what this argument is doing
> > behind the scenes, or whether I should trust the summary from the
> > model using select=TRUE rather than the default select=FALSE.
> >      >     Best wishes,
> >      >     David
> >      >
> >      >     El sáb., 1 feb. 2020 a las 20:39, Voeten, C.C. (<mailto:
> > c.c.voeten using hum.leidenuniv.nl <mailto:c.c.voeten using hum.leidenuniv.nl>
> > <mailto:c.c.voeten using hum.leidenuniv.nl <mailto:
> > c.c.voeten using hum.leidenuniv.nl>>>) escribió:
> >      >     Hi David,
> >      >
> >      >     1) You cannot perform likelihood-based model comparisons
> > with bam models, or -- for completeness' sake -- with gam models
> > that were fitted using performance iteration or the EFS optimizer.
> > All of these are based on PQL (penalized quasi-likelihood), which
> > makes the log-likelihood (and hence LRT, AIC, BIC, etc) invalid for
> > comparison purposes. See Wood (2017:149-151). gam() with the
> > default outer iteration should be fine, though. Have you tried
> > fitting your full model using bam with the select=TRUE argument to
> > turn on mgcv's automatic smooth-term selection?
> >      >
> >      >     2) I am unsure if the deviance explained is or is not
> > suitable for indicating effect size, so I can't comment on this
> > question. I might, however, have an alternative suggestion: have
> > you considered partial eta squared or partial omega squared? You
> > should be able to calculate those based on the ANOVA table.
> >      >
> >      >     3) I agree with you that the warning suggests complete
> > separation, but in my experience this doesn't automatically have to
> > be a problem. Have you checked the summary for extremely large beta
> > values, and also have you run gam.check() to see if your fit looks
> > reasonable? If neither indicates a problem I wouldn't be too
> > concerned about it.
> >      >
> >      >     Hope this helps,
> >      >
> >      >     Cesko
> >      >
> >      >     P.S.: please send messages in plain text only, as you
> > can see the formatting of your message was slightly screwed up
> > because the mailing list automatically strips HTML markup
> >      >
> >      >     -----Oorspronkelijk bericht-----
> >      >     Van: R-sig-mixed-models <mailto:
> > r-sig-mixed-models-bounces using r-project.org <mailto:
> > r-sig-mixed-models-bounces using r-project.org> <mailto:
> > r-sig-mixed-models-bounces using r-project.org <mailto:
> > r-sig-mixed-models-bounces using r-project.org>>> Namens David Villegas
> > Ríos
> >      >     Verzonden: zaterdag 1 februari 2020 19:57
> >      >     Aan: r-sig-mixed-models <mailto:
> > r-sig-mixed-models using r-project.org <mailto:
> > r-sig-mixed-models using r-project.org> <mailto:
> > r-sig-mixed-models using r-project.org <mailto:
> > r-sig-mixed-models using r-project.org>>>
> >      >     Onderwerp: [R-sig-ME] bam model selection with 3 million
> > data
> >      >
> >      >     Dear list,
> >      >
> >      >     I´m investigating the effect of three variables (X, Y,
> > Z) on the probability that an animal uses a particular habitat A. I
> > have a time series of relocations for each animal (>300
> > individuals), with one relocation every 30 minutes. There are only
> > two options for the response
> >      >     variable: 1=present in habitat A, 0=not present in
> > habitat A. The effects of the three variables are expected to be
> > non-linear so I´m using gam models. My dataset is very large, with
> > >3 million data points so I´m using the bam function from the mgcv
> > library in R. In my models I include a random effect “individual
> > ID”, and a temporal autocorrelation term that corrects much but not
> > all of the autocorrelation in the models.
> >      >
> >      >     *Question 1.*
> >      >
> >      >     When I run a model with the three main effects (X, Y, Z)
> > and the three double interactions (X:Y, X:Z, Y:Z), I get that all
> > terms are highly significant, except for one interaction. If I
> > remove it, then everything is highly significant. However, I also
> > wanted to run simpler models with only one interaction, no
> > interactions, only two main effects and only one main effect. Then,
> > if I compare all these models with AIC or BIC, I get that the best
> > model (by far) is the one with only main effects.
> >      >
> >      >      >
> >      >         
> > AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codco
> > aAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)
> >      >
> >      >                        df      AIC
> >      >
> >      >     codcoaAR2   306.1310 -1442543
> >      >
> >      >     codcoaAR2.1 293.1608 -1440642
> >      >
> >      >     codcoaAR2.2 292.9615 -1438219
> >      >
> >      >     codcoaAR2.3 294.3657 -1435346
> >      >
> >      >     codcoaAR2.4 284.0026 -1434286
> >      >
> >      >     codcoaAR2.5 280.3472 -1396765
> >      >
> >      >     codcoaAR2.6 279.6380 -1435862
> >      >
> >      >     codcoaAR2.7 269.4968 -1377806
> >      >
> >      >     codcoaAR2.8 269.0480 -1393897
> >      >
> >      >     codcoaAR2.9 281.8584 -1214270
> >      >
> >      >     codcoaAR    271.7066 -2353481  # model with only main
> > effects
> >      >
> >      >
> >      >
> >      >     I wonder how this is possible if two of the interactions
> > are highly significant.
> >      >
> >      >     So my underlying question is: *for a model like this in
> > which sample size is huge, should I make model selection looking at
> > the significance of the different terms in the model, or should I
> > rather look at AIC/BIC?*
> >      >
> >      >     *Question 2.*
> >      >
> >      >     Let´s assume the model with only main effects is indeed
> > the optimal one.
> >      >     Then I´d like to get the effect size of each explanatory
> > variable. It´s not clear to me how to do it even after reading some
> > post on this and other forums, but I tried to figure it out by
> > sequentially running the model without one explanatory variable at
> > a time, and then comparing the deviance explained in the optimal
> > model with X, Y, Z with the deviance explained with the reduced
> > model with only Y and Z, for instance. Assuming that the difference
> > would the variance explained by X. *Is this correct? *Looking at
> > the results, the deviance explained by each variable X, Y, Z is
> > quite low, but if the three main effects explain so little
> > variance, who is explaining the rest?
> >      >
> >      >     Model
> >      >
> >      >     Deviance explained
> >      >
> >      >     X, Z, Y
> >      >
> >      >     69.3%
> >      >
> >      >     Y, Z
> >      >
> >      >     68.5%
> >      >
> >      >     X, Z
> >      >
> >      >     69.3%
> >      >
> >      >     X, Y
> >      >
> >      >     60.5%
> >      >
> >      >
> >      >
> >      >     *Question 3.*
> >      >
> >      >     In my models I usually get this error message:
> >      >
> >      >     Warning message:
> >      >
> >      >     In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho =
> > rho, coef = coef,  :
> >      >
> >      >        fitted probabilities numerically 0 or 1 occurred
> >      >
> >      >
> >      >
> >      >     which seems to indicate that there is perfect separation
> > in my logistic regression. I´m not sure this is the case in my
> > data, how could I check it and correct for it if needed? Should it
> > be always corrected?
> >      >
> >      >
> >      >
> >      >     Thanks for your help,
> >      >
> >      >     David
> >      >
> >      >              [[alternative HTML version deleted]]
> >      >
> >      >     _______________________________________________
> >      >     mailto:R-sig-mixed-models using r-project.org <mailto:R-sig-
> > mixed-models using r-project.org> <mailto:R-sig-mixed-models using r-
> > project.org <mailto:R-sig-mixed-models using r-project.org>> mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> >      >
> > 
> 
> _______________________________________________
> R-sig-mixed-models using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models