[R-sig-ME] Using nlme package in an information-theoretic framework, AIC

Tue Jan 4 22:01:59 CET 2011

I welcome this expression of scepticism re AIC or AICc for multi-level models.

The Kullback-Leibler statistic that is one way to derive AIC has to be
derived by integrating over some reference distribution.  This will be,
ideally, the distribution in the "population" where results will be applied -
the "target" population.  If there are 6 plots per block in the data, and 3 
plots per block in the "target" population, the AIC that is relevant is the
one that is calculated with respect to the target population.  Thus, there
are as many different AICs are there are potential different target
populations.  The software solves the problem by pretending that the
target population has the same distributional structure as that estimated
for the data used for the analysis.  I'm not aware of anyone anywhere
that actually says this in a paper, but it ought to be said with more 
authority than my voice carries here!

Vaida and Blanchard address one special case, where the target
population is at one extreme of the range of possibilities.

One benefit of the Bayesian approach is that plots per block types
of issues arise as part of the wider issue of what is an appropriate
distribution for the prior.

Actually, this is a wider issue than AIC.  Without some prior distributional
assumptions, t-like statistics either have to be conservative, or may not
be possible at all.  The Behrens-Fisher 'problem', which is really a
problem with a multi-level model, is one small facet of this.

Think of possible training/test set differences when models are evaluated
on test data.   If the training data is the data you have, and the test data
represents the target population more accurately (maybe it came along
later, after there'd been an initial analysis), then an evaluation based on
the training data (AIC or whatever) is an evaluation with respect to an 
empirical distribution that is different from that of the initial training data.

These sorts of issues are very widespread in statistics, and ought 
accordingly to get a lot more airing than they do.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On 05/01/2011, at 8:13 AM, Ben Bolker wrote:

> On 11-01-04 01:40 PM, Corey VanStratt wrote:
>> Hi Ben,
>> 
>> Thanks for the response; good to know that there are not any
>> immediate procedural issues.
>> 
>> I've looked back through some of the threads and personal
>> correspondences, and I think the main issue raised was that it can be
>> challenging to count the 'degrees of freedom'. For example, from a
>> personal communication: "A second point is that you should proceed a
>> bit cautiously using the AIC that's being outputted in R.  The reason
>> for this is that it's the fixed-effects version of AIC, and what
>> you're wanting is a random-effects version...  The fixed-effects
>> version only adds a penalty of 1 parameter for each type of random
>> effect, which is an underestimate of real complexity." However, from
>> your response, I may not have to worry about this given that I'm
>> comparing models with the same random effects?
> 
>  Yes, although you still need to count degrees of freedom ascribed to
> random effects in order to determine your 'residual degrees of freedom'
> for AICc; this leads to an even murkier corner of the literature,
> because with blocked and/or correlated data it's also not really clear
> (as I mentioned obliquely) how to define your 'total number of data
> points' either.  (AICc was defined for linear models in the first place,
> and has been stretched way beyond its original derivation, although it's
> not clear to me whether which is the lesser of two evils -- pretending
> you have an unlimited amount of data or using a poorly supported
> finite-size correction ...)
>  If you were comparing models with different number of random effects,
> I think Vaida and Blanchard's (2005?) paper on conditional AIC actually
> gives a pretty strong justification that if you're interested in the
> population level (and not predictions for the individual random-effects
> units) that counting variance parameters rather than something more
> complicated is the right way to go.
>   If this were really important and you were willing to go to a lot of
> trouble to make sure the answer was as correct as modern statistical
> technology could make it, I would probably try some sort of
> cross-validation approach.
>   If it makes you feel any better, I think the problem with AIC[c] in
> this context is not that nlme does something wrong, but that nobody
> really knows what the right answer is.  (That might make you feel worse
> rather than better.)
> 
>> And to answer your other question: yes, we are using multi-model
>> inference.  We determined parameter likelihood values by summing the
>> weights of all models that a particular parameter of interest
>> occurred in.  We calculated model averaged parameter estimates and
>> unconditional standard errors for all predictors and we used these
>> values as our prediction values in the models. 
> 
>  Multi-model inference is definitely better than model selection, but
> there are a number of things I don't like about this approach (but again
> be aware that I am probably in the minority).
> 
>   1. "Parameter likelihood values" are not in themselves ridiculous,
> but they tend to get overinterpreted.  In my opinion (I'd be happy to be
> corrected) they don't have a coherent statistical meaning; you could
> consider them a quasi-pseudo-Bayesian estimate of the probability that a
> parameter 'belongs in the true model', but I don't even want to list how
> many problems there are with that definition ...  How are you
> interpreting them?
> 
>  2. It's tricky to combine parameters from different models whenever
> the parameters potentially have different meanings in the different
> models (e.g. when there are interactions in some of the models and not
> in others, or when some of the predictors are collinear).  It's best to
> compute model averages for things that have real biological meaning and
> interest, such as (in your case) foraging rates.  If you want to compute
> model-averaged effects of different environmental conditions on foraging
> rates, that may indeed mean that you want to model-average parameters,
> but you should be careful that the model parameters are really
> interpretable in similar ways across the models you're considering.
>  Just don't take your model-averaged parameters and attribute
> "significance" to whether the model-averaged confidence intervals
> overlap zero, because then you are truly doing hypothesis testing by the
> back door ...
> 
>> Ultimately, we'll be
>> comparing rates of foraging at this site to other sites along the
>> wintering latitudinal cline (these ducks winter from Alaska down to
>> Baja), but we're proceeding with the understanding that as far as
>> overall population inference goes, we're limited in our inference due
>> to limited sites (n=5).
> 
>  Is there geographic structure to your sites?  You might squeeze out a
> tiny bit more information by incorporating this in the model (rather
> than 'just' the exponential spatial correlation you're currently using).
> 
>> Thanks again for commenting on the overall approach that I've used
>> thus far.
>> 
>> Cheers, Corey
>> 
>> 
>> Message: 3 Date: Mon, 03 Jan 2011 20:34:26 -0500 From: Ben Bolker
>> <bbolker at gmail.com> To: r-sig-mixed-models at r-project.org Subject: Re:
>> [R-sig-ME] Using nlme package in an information-theoretic framework,
>> AIC Message-ID: <4D227922.6030206 at gmail.com> Content-Type:
>> text/plain; charset=UTF-8
>> 
>> On 11-01-03 06:51 PM, Corey VanStratt wrote:
>>> Dear Listers,
>>> 
>>> I am studying the foraging effort of surf scoters (a sea duck)
>>> during the winter period @ a site in southeast Alaska. This
>>> particular species only dives for its food, and therefore the
>>> response of interest is both minutes underwater per hour and
>>> minutes underwater per day.  These 2 metrics provide a reliable
>>> estimate of how often these birds are feeding during the winter,
>>> and they can be compared to estimates collected in the same manner
>>> at different winter sites. To collect these data, we attached VHF
>>> transmitters to 110 individuals; we monitored individuals during
>>> 1-hour foraging observations; we knew a bird was diving when a
>>> signal was lost, as the VHF radio signal does not transmit through
>>> water.  For each observation, we calculated total minutes
>>> underwater/hour and extrapolated to minutes underwater/day (these
>>> ducks are diurnal foragers, so based daily rates on amount of
>>> daylight hours).  Of the 110 individuals marked, we were able to
>>> get at least 2 observations on 65 individuals. Over 2 seasons,
>>> there were a total of 959 1-hour observations on these 65
>>> individuals.
>>> 
>>> The data set was unbalanced and we observed the same individuals
>>> over time (repeated measures, therefore requiring mixed models);
>>> the number of observation periods was not equal among all
>>> individuals and time between observations within and among
>>> individuals was not equal. Therefore, we required a mixed modeling
>>> procedure to account for these repeated measures of individuals.
>>> We used the nlme package in R to specify individual as a random
>>> effect.
>>> 
>>> Foraging effort variation at all sites was evaluated using an 
>>> information-theoretic approach.  Therefore, rather than 'model 
>>> building', as is described in much of the nlme documentation and 
>>> resources, I am testing several pre-determined, biologically 
>>> plausible models. However, I have read in prior posts that AIC
>>> values may not be reliable from this package, and I have been using
>>> these values to compare the models. So, I have three main
>>> questions:
>>> 
>>> 1.) Does the process I describe below to analyze these data make 
>>> sense?
>>> 
>>> 2.) In the case of this data set and the candidate model set, is
>>> it appropriate to use the given AIC values from the nlme output
>>> for determining which model is most parsimonious?
>>> 
>>> AND
>>> 
>>> 3.) If it is not appropriate to use these AIC values, what is the 
>>> best alternative (e.g. how could I obtain AIC values that are 
>>> reliable)? So, moving onto the process:
>>> 
>>> We evaluated the variation in response of both minutes underwater
>>> per hr and per day in relation to several covariates.  We
>>> considered 16 biologically plausible a priori candidate models (see
>>> below).  We explored date as an explanatory variable, and we
>>> included it both as a date and date squared term (referred to as
>>> DATE), to allow variation in foraging effort to vary non-linearly
>>> over time; we adjusted all date values in both years to correspond
>>> to 2 December = day 1.  We always considered sex (male or female)
>>> and age [hatch year (HY) or after hatch year (AHY)] together as
>>> COHORT, as these four groups often behave differently.  We included
>>> a YEAR term (2008/09 or 2009/10) to account for variation between
>>> years.  We considered the following environmental factors (ENVIRON)
>>> in each model except for the true null: tide height (m),
>>> temperature (?C), time of day of observation, and wind speed
>>> (categorical with three levels?low, mid, high).  We did this to
>>> both limit the number of models in our set, and because we merely
>>> wanted to account for environmental effects. Finally, we included
>>> the interaction effect terms COHORT*DATE and COHORT*YEAR, because
>>> we believed it quite feasible that the different cohorts would
>>> respond to environmental pressures differently both across time
>>> within a winter season and between the winter seasons.
>>> 
>>> MODEL   EXPLANATORY VARIABLES
>> 
>> fmN	Null fm1	Environ fm2	Date  + environ fm3	Cohort  + environ fm4
>> Year  + environ fm5	Date + cohort + environ fm6	Date + year +
>> environ fm7	Cohort + year + environ fm8     Date + cohort + year +
>> environ fm9	Date + environ + cohort x date fm10	Year + environ +
>> cohort x date fm11	Date + cohort + environ + cohort x date fm12	Date
>> + cohort + year + environ + cohort x date fm13	Cohort + environ +
>> cohort x year fm14	Date + cohort + environ + cohort x year fm15	Date
>> + cohort + year + environ + cohort x date + cohort x year
>> 
>>> For each response variable, we first ran the most global model
>>> (fm15) in our set using a model with no random effects, a model
>>> with a random intercept (allowed intercept to vary among
>>> individuals), and a model with a random intercept and slope
>>> (allowed both intercept to vary among individuals and slope to vary
>>> within each individual over DATE).  'ID' in the random effects
>>> refers to each of the 65 uniquely identified ducks.  Using an AIC
>>> approach, we found that the random intercept and slope model was
>>> the most parsimonious.  Proceeding with this global model and
>>> continuing with an AIC information-theoretic approach, we ran the
>>> global model (inc. the random intercept/slope) against several
>>> different feasible covariance structures, and found that for both
>>> response variables (minutes underwater hr-1 and day-1), an
>>> exponential covariance structure was the best-fitting based on AIC 
>>> values.  Therefore, we applied this covariance structure to all of 
>>> the models in the candidate model set and proceeded with the
>>> overall best-fitting model selection process. That final most
>>> parameterized model specification as written for R is as follows:
>>> 
>>> fm15 <- 
>>> lme(fixed=MINh~((Date+DateSq)*(Cohort))+((Cohort)*Year)+Tide+ 
>>> Temp+factor(Sea)+Time.start,data=data1, 
>>> random=~Date+DateSq|ID,corr=exp2,method="ML")
>>> 
>>> We calculated Akaike?s Information Criterion (AICc), adjusted for 
>>> small sample size, for all of the models in the candidate model
>>> set from the AIC values given in the lme output.  We then examined 
>>> relative support for each model using ? AICc and AICc weights
>>> (wi).
>>> 
>>> So, again, being relatively new to both R and mixed modeling, I
>>> would be much appreciative if those with a greater understanding of
>>> the process could inform me if the process I have used indeed is 
>>> appropriate/makes sense (as I have not used a model building
>>> exercise that is often described when using this package), and if
>>> the AIC values can in fact be used with relative confidence.
>>> 
>>> Thanks so much for any insight, Corey VanStratt
>>> 
>>> 
>> Example of output from fm15:
>> 
>> Linear mixed-effects model fit by maximum likelihood Data: data1 AIC
>> BIC    logLik 5723.309 5835.224 -2838.654
>> 
>> Random effects: Formula: ~DateB + DateSq | ID Structure: General
>> positive-definite, Log-Cholesky parametrization StdDev      Corr 
>> (Intercept) 3.500360472 (Intr) DateB DateB       0.175541462 -0.680 
>> DateSq      0.001982993  0.576 -0.967 Residual    4.185115132
>> 
>> Correlation Structure: Exponential spatial correlation Formula:
>> ~DateB | ID Parameter estimate(s): range 0.674106 Fixed effects: MINh
>> ~ DateB + DateSq + Cohort + (Cohort):(Season) + Tide + Temp +
>> factor(Sea) + Time.start Value Std.Error  DF   t-value p-value 
>> (Intercept)         9.972296  2.624219 887  3.800100  0.0002 DateB
>> 0.025724  0.031821 887  0.808423  0.4191 DateSq              0.000128
>> 0.000336 887  0.379635  0.7043 CohortAHY M        -7.337342  3.060351
>> 57 -2.397549  0.0198 CohortHY F          4.634536  4.220465  57
>> 1.098110  0.2768 CohortHY M         -7.870449  3.346530  57 -2.351824
>> 0.0222 Tide                0.157180  0.088498 887  1.776074  0.0761 
>> Temp               -0.009083  0.039280 887 -0.231239  0.8172 
>> factor(Sea)6        0.111870  2.194239 887  0.050983  0.9594 
>> factor(Sea)7-8      3.294262  1.961388 887  1.679556  0.0934 
>> Time.start          0.087318  0.068847 887  1.268298  0.2050 
>> CohortAHY F:Season -1.925349  1.412226  57 -1.363344  0.1781 
>> CohortAHY M:Season  1.804896  1.235439  57  1.460935  0.1495 CohortHY
>> F:Season  -4.522203  1.914241  57 -2.362400  0.0216 CohortHY M:Season
>> 2.323518  1.469479  57  1.581185  0.1194 Correlation:
>> 
>> 
>> Standardized Within-Group Residuals: Min         Q1        Med
>> Q3        Max -2.5466867 -0.6254431 -0.1261256  0.5150277  3.4233514
>> 
>> ===================================
>> 
>> As far as I can tell, this all looks pretty reasonable.
>> 
>> General questions/comments:
>> 
>> 
>> - what are your concerns about unreliable AIC values?  Can you quote
>> a specific post to this list?  There are various concerns with AICs
>> in the mixed model context, but I don't know that any of them seems 
>> particularly awful in the context you describe: - there is a general
>> issue when testing against null hypotheses on the edge of their
>> feasible space (e.g. testing whether variances are equal to zero or
>> not), which also affects AICs (because the approximations on which
>> AICs are based rely on expanding *around* the model) - it can be hard
>> to count 'degrees of freedom' for random effects, both for comparing
>> models with different random effects (here Vaida and Blanchard's
>> 'conditional AIC' is useful, but I think you're interested in
>> population-level predictions, so the relevant number of parameters 
>> would be approx. 1) and for calculating residual degrees of freedom
>> for finite-size corrections such as AICc.  Your spatial correlation
>> term also affects this consideration, because your data points are
>> not entirely independent (and hence you have fewer 'effective'
>> residual degrees of freedom). - What kind of inference are you making
>> *after* you go through the model selection procedure?  Are you
>> planning to use multi-model inference?  The biggest mistake that you
>> might be making (can't tell from your description) would be to take
>> the selected model and calculate confidence intervals for it,
>> disregarding the model selection uncertainty. Or are you using the
>> AICc values to (effectively) test hypotheses about which variables
>> are 'important'?  I have a bit of a problem with this approach (I
>> think if you're going to test hypotheses you should do it explicitly,
>> not use an approach that is based on maximizing expected predictive
>> accuracy), but I'm quite in the minority in that opinion.
>> 
>> Bottom line: I would strongly encourage you to supplement what
>> you've done by exploring the data, and the predictions/residuals from
>> the model graphically, and I am curious what kind of inferences you
>> plan to make from these data, but overall your approach seems to make
>> sense -- no big red flags going up for me.
>> 
>> Others may see something I missed, or have different opinions.
>> 
>> Ben Bolker
>> 
>> 
>> 
>> ------------------------------
>> 
>> _______________________________________________ R-sig-mixed-models
>> mailing list R-sig-mixed-models at r-project.org 
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>> 
>> 
>> End of R-sig-mixed-models Digest, Vol 49, Issue 4
>> 
>> _______________________________________________ 
>> R-sig-mixed-models at r-project.org mailing list 
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models