[R-sig-ME] Maximal specification of random effects

Wed Jun 5 00:49:39 CEST 2013

Hi Han-Gyol,

I'm personally yet to see any context in which it makes sense to
include a variable on *both* sides of the "|", so I for one am
sceptical of "(1 + item | subject) + (1 + trial + environment |
item)".

(Note for parsimony's sake this is equivalent to "(item | subject) +
(trial + environment | item)" since random intercepts are included by
default.)

Others may disagree, but I don't even see a lot of reason to include
random intercepts for category, context, and batch (given the small
numbers of each). If anything, I'd fit the model as:

outcome ~ trial * environment * category * context * batch + (1 | subject)

Check to see whether you need all the interactions -- maybe not. And
the model may not even converge, though with 6*80 observations per
subject, I think it has a good chance of doing so.

If you want to keep it maximal, my view would be:

outcome ~ trial * environment * category * context * batch + (trial *
category * context * batch | subject)

Though you may have to simplify further, again for convergence's sake.
Relevant substantive theory, setting REML to FALSE, and comparing
nested models with likelihood ratio tests should together help you
decide how complex a model to keep.

Cheers,
Malcolm

> Date: Tue, 4 Jun 2013 15:45:03 -0500
> From: Han-Gyol Yi <han.yi.query at gmail.com>
> To: r-sig-mixed-models at r-project.org
> Subject: [R-sig-ME] Maximal specification of random effects
>
> Hello,
>
> We have an experiment of which results have been submitted for review in a
> journal. We are using lmer to analyze the data. In specifying the random
> effects, we have been suggested to "keep it maximal", per Barr et al.
> (2013). However, given our design, it has not been trivial for me to
> construct such a model.
>
> The experiment is a learning experiment in which each subject is presented
> with a random succession of stimuli, which must be categorized. The
> subjects do not know of the category structure beforehand. For instance, at
> trial number 1, subject #999 receives a stimulus #32, which is in category
> X. At trial number 2, stimulus #41 is presented, which is in category Y,
> and so forth, for 480 trials per subject.
>
> There are four categories of stimuli in total (e.g., W,X,Y,Z). Each
> category is embedded in five contexts (e.g., O,P,Q,R,S), which in turn are
> from four separate batches (e.g., A,B,C,D). This is relevant because
> although the target of learning is the category, each instance will vary
> according to the context and batch where they come from. In other words,
> the subjects must be able to abstract out categorical information from the
> variance caused by each context and batch, and otherwise they will not be
> successful in the learning task. This gives us 80 stimuli in total:
>
> WOA, WOB, WOC, WOD, WPA, WPB, ... ZSA, ZSB, ZSC, ZSD.
>
>
> Each random sequence of all 80 stimuli are repeated six times, and for each
> trial we have correct/incorrect outcome of the subjects' responses.
>
> Additionally -- and most importantly -- each subject is assigned to only
> one of the two learning environments, making the overall design a
> between-subjects analysis. What we are interested in is how the two
> learning environments affect the rate and/or final level of achievement of
> learning across succeeding trials. Consequently, the data are constructed
> such as the following, with columns delimited by commas:
>
> subject, environment, trial, category, context, batch, outcome
>> ...
>> 999, conducive, 180, W, R, C, incorrect
>> 999, conducive, 181, X, O, D, correct
>> ...
>> 333, adverse, 4, Z, O, C, correct
>> ...
>
>
> and so forth, with (n of subjects) x 480 rows in total.
>
> To recap, I have one b/n subjects variable ("environment": conducive vs.
> adverse) and one w/n subjects variable ("trial": 1 to 480, mean centered to
> 0) that I am interested in as fixed effects. Category, context, batch are
> random effects, where outcome is my dependent variable.
>
> The formula I have been using is this:
>
> outcome ~ trial * environment + (1 | category) + (1 | context) + (1 |
>> batch) + (1 | subject)
>
>
> However, per the suggestion, I want to specify a design-driven maximal
> model of random effects. One possible option I have considered is to treat
> each stimulus as an instance of 80 items total, disregarding the systematic
> variation coming from context and batch, so something like the following:
>
> outcome ~ trial * environment + (1 + item | subject) + (1 + trial +
>> environment | item)
>
>
> I am not allowing the trial slope to vary by each subject because I think
> that's confounded with the between-subjects environment variable. This
> makes some sense to me, but I cannot be sure it is valid. Of course, if it
> is invalid, then I have four random effects and to specify all permutations
> of such sound either daunting or absurd to me.
>
> I am aware that the "keep it maximal" approach may not be everyone's
> favorite, but in this case I certainly want to consider such a perspective.
> My questions are as follows, given the limited information regarding our
> study design that I have provided here:
>
>    1. Is there a reason to discard either of the data-driven or
>    design-driven model specification approach?
>    2. In case the design-driven approach is to be used, what would make
>    "the most sense" in terms of specifying a maximal model?
>    3. In case the data-driven approach is to be used, how extensive should
>    my search be, until I conclude that all possibilities have been exhausted
>    and I am confident that I have the most complex model as justified by the
>    data?
>
> I appreciate your reading this long note. Please let me know should
> necessary details be wanting, or if I have unintentionally rephrased a
> question that has already been asked and answered previously.
>
> Best regards,
> Han-Gyol Yi