Chapter 7 Random and Mixed Effects Models

In this chapter we use a new “philosophy”. Up to now, treatment effects (the \(\alpha_i\)’s) were fixed, unknown quantities that we tried to estimate. This means we were making a statement about a specific, fixed set of treatments (e.g., some specific fertilizers). Such models are also called fixed effects models.

7.1 Random Effects Models

7.1.1 One-Way ANOVA

Now we use another point of view: we consider situations where treatments are random samples from a large population of treatments. This might seem quite special at first sight, but it is actually very natural in many situations. Think for example of a random sample of school classes that were drawn from all school classes in a country. Another example could be machines that were randomly sampled from a large population of machines. Typically, we are interested in making a statement about some properties of the whole population and not of the observed individuals (here: school classes or machines).

Going further with the machine example: assume that every machine produces some samples whose quality we assess. We denote the quality of the \(j\)th sample on the \(i\)th machine with \(Y_{ij}\). We can model such data with the model \[\begin{equation} Y_{ij} = \mu + \alpha_i + \epsilon_{ij}, \tag{7.1} \end{equation}\] where \(\alpha_i\) is the effect of the \(i\)th machine. As the machines were drawn randomly from a large population, we assume \[ \alpha_i \; \textrm{i.i.d.} \sim N(0, \sigma_{\alpha}^2). \] We also call \(\alpha_i\) a random effect. Hence, (7.1) is a so-called random effects model. For the error term we have the ususal assumption \(\epsilon_{ij}\) i.i.d. \(\sim N(0, \sigma^2)\). This looks very similar to the “old” fixed effects model (3.4) at first sight. However, note that the \(\alpha_i\)’s in Equation (7.1) are random variables (reflecting the sampling mechanism of the data) and not fixed unknown parameters. This “small” change will have a large impact on the properties of the model. In addition, we have a new parameter \(\sigma_{\alpha}^2\) which is the variance of the random effect (here the variance between different machines). Sometimes, such models are also called variance components models because of the different variances \(\sigma_{\alpha}^2\) and \(\sigma^2\).

Let us now inspect some properties of model (7.1). The expected value of \(Y_{ij}\) is \[ E[Y_{ij}] = \mu. \] For the variance of \(Y_{ij}\) we have \[ \Var(Y_{ij}) = \sigma_{\alpha}^2 + \sigma^2. \] For the correlation structure we get \[ \Cor(Y_{ij}, Y_{kl}) = \left\{ \begin{array}{cc} 0 & i \neq k \\ \sigma_{\alpha}^2 / (\sigma_{\alpha}^2 + \sigma^2) & i = k, j \neq l \\ 1 & i = k, j = l \end{array} \right. \] Observations from different machines are uncorrelated (here: independent) while observations from the same machine are correlated. We also call \(\sigma_{\alpha}^2 / (\sigma_{\alpha}^2 + \sigma^2)\) intraclass correlation (ICC). It is large if \(\sigma_{\alpha}^2 \gg \sigma^2\). A large value means that observations from the same “group” (here: machine) are very similar to each other.

Hence, values “sharing” the same \(\alpha_i\) are correlated, while in the fixed effects model (3.4) all values would be independent (because there the \(\alpha_i\)’s are parameters, that is fixed, unknown quantities).

Parameter estimation for the variance components \(\sigma_{\alpha}^2\) and \(\sigma^2\) is typically being done with a technique called restricted maximum likelihood (REML). We could also use “classical” maximum-likelihood estimators here, but REML estimates are less biased. The parameter \(\mu\) is estimated with maximum-likelihood assuming that the variances are known.

In R there are many packages that can fit such models. We will consider lme4 (Bates et al. 2020) and later also lmerTest (Kuznetsova, Bruun Brockhoff, and Haubo Bojesen Christensen 2020).

Let us now consider Exercise 5.1 from Kuehl (2000) about an inheritance study with beef animals. Five sires (male animals) were each mated to a separate group of dams (female animals). Birth weight of eight male calves (from different dams) in each of the five sire groups were recorded. We want to find out what part of the total variation is due to variation between different sires.

We first create the data set and visualize it.

## Create data set ####
weight <- c(61, 100,  56, 113,  99, 103,  75,  62,  ## sire 1
            75, 102,  95, 103,  98, 115,  98,  94,  ## sire 2
            58,  60,  60,  57,  57,  59,  54, 100,  ## sire 3
            57,  56,  67,  59,  58, 121, 101, 101,  ## sire 4
            59,  46, 120, 115, 115,  93, 105,  75)  ## sire 5
sire    <- factor(rep(1:5, each = 8))
animals <- data.frame(weight, sire)
str(animals)
## 'data.frame':    40 obs. of  2 variables:
##  $ weight: num  61 100 56 113 99 103 75 62 75 102 ...
##  $ sire  : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
## Visualize data ####
stripchart(weight ~ sire, vertical = TRUE, pch = 1, xlab = "sire", data = animals)

At first sight it looks like the variation between different sires is rather small.

Now we fit the random effects model with the lmer function in package lme4. We want to have a random effect per sire. This can be specified with the notation (1 | sire) in the model formula. This means that the “granularity” of the random effect is specified after the vertical bar “|”. All observations sharing the same level of sire will get the same random effect \(\alpha_i\). The 1 means that we want to have a random intercept per sire. In the regression setup, we could also have a random slope for which we would use the notation (x | sire) for a continuous predictor x. However, this is not of relevance here.

library(lme4)
fit.sire <- lmer(weight ~ (1 | sire), data = animals)

As usual, the function summary gives an overview of the fitted model.

summary(fit.sire)
## Linear mixed model fit by REML ['lmerMod']
## Formula: weight ~ (1 | sire)
##    Data: animals
## 
## REML criterion at convergence: 358.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9593 -0.7459 -0.1581  0.8143  1.9421 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  sire     (Intercept) 116.7    10.81   
##  Residual             463.8    21.54   
## Number of obs: 40, groups:  sire, 5
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   82.550      5.911   13.96

In the summary we can read off the table labelled Random Effects that \(\widehat{\sigma}_{\alpha}^2 = 116.7\) and \(\widehat{\sigma}^2 = 463.8\). Note that the column Std.Dev is nothing else than the square root of the variance and not the standard error (accuracy) of the variance estimate. The variance of \(Y_{ij}\) is therefore estimated as \(116.7 + 463.8 = 580.5\). Hence, only about \(116.7 / 580.5 = 20\%\) of the total variance of the birth weight is due to sire (this is the intraclass correlation). This confirms our visual inspection from above.

Under Fixed effects we find the estimate \(\widehat{\mu} = 82.55\). It is an estimate for the expected birthweight of a male calve of a randomly selected sire (randomly selected from the whole population of all sires).

Approximate confidence intervals can be obtained with the function confint.

confint(fit.sire, oldNames = FALSE)
##                        2.5 %   97.5 %
## sd_(Intercept)|sire  0.00000 24.61580
## sigma               17.32943 27.76544
## (Intercept)         69.83802 95.26197

Hence, an approximate confidence interval for \(\sigma_{\alpha}\) is given by \([0, \, 24.62]\). We see that the estimate \(\widehat{\sigma}_{\alpha}\) is therefore quite imprecise. The reason is that we only have five sires to estimate the variance.

What would happen if we used the “ordinary” aov function here?

options(contrasts = c("contr.sum", "contr.poly"))
fit.aov <- aov(weight ~ sire, data = animals)
confint(fit.aov)
##                 2.5 %   97.5 %
## (Intercept)  75.63725 89.46275
## sire1       -12.75051 14.90051
## sire2         1.12449 28.77551
## sire3       -33.25051 -5.59949
## sire4       -18.87551  8.77551

As we have used contr.sum the confidence interval for \(\mu\) which can be found under (Intercept) is a confidence interval for the average of the expected birthweight of the male offsprings of these five specific sires. More precisely: Each of the five sires has its expected birthweight \(\mu_i, i = 1, \ldots, 5\) (of a male offspring). With contr.sum we have \[ \mu = \frac{1}{5} \sum_{i=1}^g \mu_i, \] i.e., \(\mu\) is the average of the expected value of these five specific sires. We see that this confidence interval is shorter than the one from the random effects model. The reason is the different interpretation. The random effects model allows to make inference about the population of all sires (whereof we have seen five so far) while the fixed effects model allows to make inference about these five specific sires. Hence, we are facing a more difficult problem with the random effects model, this is why we are less confident in our estimate resulting in wider confidence intervals compared to the fixed effects model.

We can also get “estimates” (more precisely: conditional means) of the random effects \(\alpha_i\) with the function ranef.

ranef(fit.sire) ## "estimated" (better: conditional means of) random effects
## $sire
##   (Intercept)
## 1   0.7183096
## 2   9.9895155
## 3 -12.9796882
## 4  -3.3743848
## 5   5.6462479
## 
## with conditional variances for "sire"

We should of course also check the model assumptions. Here, this includes in addition normality of the random effects (though this is hard to check with only five observations). We get the Tukey-Anscombe plot with the plot function.

plot(fit.sire) ## TA-plot

To get QQ-plots of the random effects and the residuals we need to extract them first and then use qqnorm as usual.

par(mfrow = c(1, 2))
qqnorm(ranef(fit.sire)$sire[, "(Intercept)"], main = "Random effects")
qqnorm(resid(fit.sire), main = "Residuals")

Sometimes you also see tests of the form \(H_0: \sigma_{\alpha} = 0\) vs.  \(H_A: \sigma_{\alpha} > 0\). For model (7.1) we could actually get exact p-values, while for complex models later this is not the case anymore. There, approximate or simulation based methods have to be used. Some can be found in package RLRsim (Scheipl 2020).

7.1.2 More Than One Factor

So far this was a one-way ANOVA model with random effects. We can extend this to the two-way ANOVA situation. For this reason we consider Example 7.1 in Kuehl (2000). A manufacturer was developing a new spectrophotometer for medical labs. A critical issue is consistency of measurements from day to day among different machines. 4 machines were randomly selected from the production process and tested on 4 randomly selected days. Per day 8 serum samples were randomly assigned to the 4 machines (2 samples per machine). Response is the triglyceride level [mg/dl] of a sample.

## Create data-set ####
##     machine 1     machine 2     machine 3     machine 4
y <- c(142.3, 144.0, 148.6, 146.9, 142.9, 147.4, 133.8, 133.2, ## day 1
       134.9, 146.3, 145.2, 146.3, 125.9, 127.6, 108.9, 107.5, ## day 2
       148.6, 156.5, 148.6, 153.1, 135.5, 138.9, 132.1, 149.7, ## day 3
       152.0, 151.4, 149.7, 152.0, 142.9, 142.3, 141.7, 141.2) ## day 4

trigly <- data.frame(y = y, day = factor(rep(1:4, each = 8)),
                     machine = factor(rep(rep(1:4, each = 2), 2)))
str(trigly)
## 'data.frame':    32 obs. of  3 variables:
##  $ y      : num  142 144 149 147 143 ...
##  $ day    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 2 2 ...
##  $ machine: Factor w/ 4 levels "1","2","3","4": 1 1 2 2 3 3 4 4 1 1 ...
## verify number of observations                                  
xtabs(~ day + machine, data = trigly)
##    machine
## day 1 2 3 4
##   1 2 2 2 2
##   2 2 2 2 2
##   3 2 2 2 2
##   4 2 2 2 2

We can for example visualize this data set with an interaction plot.

## Plot data ####
with(trigly, interaction.plot(x.factor = day, trace.factor = machine, response = y))

We denote the triglyceride level of the \(k\)th sample on day \(i\) and machine \(j\) by \(Y_{ijk}\). We use the model \[\begin{equation} Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk}, \tag{7.2} \end{equation}\] where \(\alpha_i\) is the random effect of day, \(\beta_j\) is the random effect of machine and \((\alpha\beta)_{ij}\) is the random interaction term between day and machine. For the random effects we have the usual assumptions \[ \alpha_i \, \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha}^2), \quad \beta_j \, \textrm{ i.i.d.} \sim N(0, \sigma_{\beta}^2), \quad (\alpha\beta)_{ij} \, \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha\beta}^2). \] Let us fit this in R. We want to have a random effect per day (= (1 | day)), a random effect per machine (= (1 | machine)) and a random effect per combination of machine and day (= (1 | machine:day)).

fit.trigly <- lmer(y ~ (1 | day) + (1 | machine) + (1 | machine:day), data = trigly)
summary(fit.trigly)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ (1 | day) + (1 | machine) + (1 | machine:day)
##    Data: trigly
## 
## REML criterion at convergence: 215
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.84283 -0.35581  0.03485  0.20700  2.31766 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  machine:day (Intercept) 34.72    5.892   
##  machine     (Intercept) 57.72    7.597   
##  day         (Intercept) 44.69    6.685   
##  Residual                17.90    4.230   
## Number of obs: 32, groups:  machine:day, 16; machine, 4; day, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  141.184      5.323   26.52

From the output we get \(\widehat{\sigma}_{\alpha}^2 = 44.7\) (day), \(\widehat{\sigma}_{\beta}^2 = 57.7\) (machine), \(\widehat{\sigma}_{\alpha\beta}^2 = 34.7\) (interaction of day and machine) and \(\widehat{\sigma}^2 = 17.9\).

Hence, total variance is \(44.7 + 57.7 + 34.7 + 17.9 = 155\). We see that the largest contribution to the variance is variability between different machines.

7.1.3 Nesting

To introduce a new concept we consider the Pastes data set in lme4. The strength of a chemical paste product was measured for a total of 60 samples coming from 10 randomly selected delivery batches each containing 3 randomly selected casks. Hence, two samples were taken from each cask. We want to check what part of the variability of strength is due to batch and cask.

We first load the data.

data("Pastes", package = "lme4")
str(Pastes)
## 'data.frame':    60 obs. of  4 variables:
##  $ strength: num  62.8 62.6 60.1 62.3 62.7 63.1 60 61.4 57.5 56.9 ...
##  $ batch   : Factor w/ 10 levels "A","B","C","D",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ cask    : Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3 1 1 2 2 ...
##  $ sample  : Factor w/ 30 levels "A:a","A:b","A:c",..: 1 1 2 2 3 3 4 4 5 5 ...

If we carefully think about the data structure, we have just discovered a new way of combining factors. Cask 1 in batch 1 has nothing to do with cask 1 in batch 2 and so on. The “1” of cask has a different meaning for every batch. Hence, cask and batch are not crossed. We say cask is nested in batch.

We use ggplot2 to visualize the data set (R code for interested readers only).

ggplot(Pastes, aes(y = cask, x = strength)) + geom_point() + facet_grid(batch ~ .)

The batch effect doesn’t seem to be very pronounced. On the other side, casks within the same batch can be substantially different. Let us now model this with appropriate random effects.

Let \(Y_{ijk}\) be the strength of the \(k\)th sample of cask \(j\) in batch \(i\). We can then use the model \[\begin{equation} Y_{ijk} = \mu + \alpha_i + \beta_{j(i)} + \epsilon_{k(ij)}, \tag{7.3} \end{equation}\] where \(\alpha_i\) is the (random) effect of batch and \(\beta_{j(i)}\) is the (random) effect of cask within batch. Note the special notation \(\beta_{j(i)}\) which emphasizes that cask is nested in batch. We have also used this nesting notation for the error term here (the error is always nested, we have just ignored this so far). We have the “usual” assumptions on the random effects \[ \alpha_i \, \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha}^2), \quad \beta_{j(i)} \, \textrm{ i.i.d.} \sim N(0, \sigma_{\beta}^2), \] similarly for the error term.

We have to tell lmer about the nesting structure. There are multiple ways to do so. We can use the notation (1 | batch/cask) which means that we want to have a random effect per batch and per cask within batch. We could also use (1 | batch) + (1 | cask:batch) which means that we want to have a random effect per batch and a random effect per combination of batch and cask (which is the same as a nested effect). Last but not least, here we could also use (1 | batch) + (1 | sample). Why? Because sample is the combination of batch and cask. Note that we cannot use the notation (1 | batch) + (1 | cask) here because then all casks with the same number (across different batches) would share the same effect. This is not what we want.

fit.paste <- lmer(strength ~ (1 | batch/cask), data = Pastes)
summary(fit.paste)
## Linear mixed model fit by REML ['lmerMod']
## Formula: strength ~ (1 | batch/cask)
##    Data: Pastes
## 
## REML criterion at convergence: 247
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4798 -0.5156  0.0095  0.4720  1.3897 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  cask:batch (Intercept) 8.434    2.9041  
##  batch      (Intercept) 1.657    1.2874  
##  Residual               0.678    0.8234  
## Number of obs: 60, groups:  cask:batch, 30; batch, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  60.0533     0.6769   88.72

We see that most variation is in fact due to cask within batch.

A nested effect has also some associated degrees of freedom. In the previous example cask has 3 levels in each of the 10 batches. The reasoning is now as follows. As cask has 2 degrees of freedom in each level of the 10 batches, the degrees of freedom of cask are \(10 \cdot 2 = 20\).

7.2 Mixed Effects Models

In practice we often encounter models which contain both random and fixed effects. We call them mixed effects models.

7.2.1 Example: Machines Data

We start with the data set Machines from the package nlme. As stated in the help file: “Data on an experiment to compare three brands of machines used in an industrial process are presented in Milliken and Johnson (p. 285, 1992). Six workers were chosen randomly among the employees of a factory to operate each machine three times. The response is an overall productivity score taking into account the number and quality of components produced.”

data("Machines", package = "nlme")
## technical detail for nicer output:
Machines[, "Worker"] <- factor(Machines[, "Worker"], levels = 1:6, ordered = FALSE)
str(Machines, give.attr = FALSE) ## give.attr in order to shorten output
## Classes 'nffGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   54 obs. of  3 variables:
##  $ Worker : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Machine: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##  $ score  : num  52 52.8 53.1 51.8 52.8 53.1 60 60.2 58.4 51.1 ...

Let us first visualize the data (R code for interested readers only).

ggplot(Machines, aes(x = Machine, y = score, group = Worker, col = Worker)) + 
  geom_point() + stat_summary(fun.y = mean, geom = "line")

## classical interaction plot would be 
with(Machines, interaction.plot(x.factor = Machine, trace.factor = Worker, 
                                response = score))

We observe that productivity is largest on machine \(C\), followed by \(B\) and \(A\). Most workers show a similar “profile”, with the exception of worker 6 who performs badly on machine \(B\).

Let us now model this data. We assume that there is a population machine effect (think of an average “profile”), but each worker is allowed to have its own (random) deviation. With \(Y_{ijk}\) we denote the \(k\)th productivity score of worker \(j\) on machine \(i\). We use the model \[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk}, \] where \(\alpha_i\) is the fixed effect of machine \(i\) (with the usual side constraint), \(\beta_j\) is the random effect of worker \(j\) and \((\alpha\beta)_{ij}\) is the corresponding (random) interaction. What is the interpretation of this model? The “average” (over the whole population) productivity “profile” with respect to the 3 different machines is given by the fixed effect \(\alpha_i\). The (random) deviation of an individual worker consists of a general shift \(\beta_j\) (worker specific general productivity level) and a worker specific “preference” \((\alpha\beta)_{ij}\) with respect to the 3 machines.

We assume that all random effects are normally distributed, this means \[ \beta_j \, \textrm{ i.i.d.} \sim N(0, \sigma_{\beta}^2), \quad (\alpha\beta)_{ij} \, \textrm{ i.i.d.} \sim N(0, \sigma_{\alpha\beta}^2). \] For the error term we have the usual assumptions.

We visualize the model in the following plot. The solid line is the population average of the machine effect \((= \mu + \alpha_i)\), the dashed line is the population average including the worker specific general productivity level \((= \mu + \alpha_i + \beta_j)\), i.e. a parallel shift of the solid line. The dotted line in addition includes the worker specific machine preference \((= \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij})\). Observations of worker \(j\) would fluctuate around these values.

We could use the lme4 package to fit such a model. We want to have

  • a random effect \(\beta_j\) per worker: (1 | Worker)
  • a random effect \((\alpha\beta)_{ij}\) per combination of worker and machine: (1 | Worker:Machine)

Hence, the lmer call would look like this

fit <- lmer(score ~ Machine + (1 | Worker) + (1 | Worker:Machine), data = Machines)

As lme4 does not calculate p-values for the fixed effects, we use the package lmerTest instead. Technically speaking, it is nothing else than a wrapper for the same function in package lme4 but with modified outputs which include p-values. There are still many open issues regarding statistical inference in mixed effects model, see for example http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html. However, for “nice” designs (as the current one), we get proper inference using the lmerTest package.

options(contrasts = c("contr.treatment", "contr.poly"))
library(lmerTest)
fit <- lmer(score ~ Machine + (1 | Worker) + (1 | Worker:Machine), data = Machines)

We get the ANOVA table with the function anova.

anova(fit)
## Type III Analysis of Variance Table with Satterthwaite's method
##         Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## Machine 38.051  19.025     2    10  20.576 0.0002855 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The (fixed) effect of machine is significant. If we closely inspect the output, we see that an \(F\)-distrubtion with 2 and 10 degrees of freedom is being used. Where does the 10 come from? It seems to be a rather small number given the sample size. Let us remind ourselves what the fixed effect actually means. It is the average machine effect, where the average is taken over the whole population of workers. We know that every worker has its own random deviation of this effect. Hence, the relevant question is whether the worker profiles just fluctuate around a constant machine effect (all \(\alpha_i = 0\)) or whether the machine effect is substantially larger than this worker specific machine variation. Technically speaking, the worker specific machine variation is nothing else than the interaction between worker and machine. Hence, this boils down to comparing the variation between different machines (having 2 degrees of freedom) to the variation due to the interaction between machines and workers (having \(2 \cdot 5 = 10\) degrees of freedom). The lmerTest package automatically detects this because of the structure of the random effects. This way of thinking also allows another insight. If we want to get a more precise picture (smaller confidence intervals, more powerful tests) about the population average of the machine effect, we need to increase the number of workers as this will increase the denominator degrees of freedom. Increasing the number of observations for the available workers will not help.

If we call summary on the fitted object we get the individual \(\alpha_i\)’s under Fixed effects.

summary(fit)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: score ~ Machine + (1 | Worker) + (1 | Worker:Machine)
##    Data: Machines
## 
## REML criterion at convergence: 215.7
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.26959 -0.54847 -0.01071  0.43937  2.54006 
## 
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  Worker:Machine (Intercept) 13.9095  3.7295  
##  Worker         (Intercept) 22.8584  4.7811  
##  Residual                    0.9246  0.9616  
## Number of obs: 54, groups:  Worker:Machine, 18; Worker, 6
## 
## Fixed effects:
##             Estimate Std. Error     df t value Pr(>|t|)    
## (Intercept)   52.356      2.486  8.522  21.062 1.20e-08 ***
## MachineB       7.967      2.177 10.000   3.660  0.00439 ** 
## MachineC      13.917      2.177 10.000   6.393 7.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) MachnB
## MachineB -0.438       
## MachineC -0.438  0.500

Again, be careful with the interpretation because it depends on the side constraint that is being used. For example, here we can read off the output that the productity score on machine \(B\) is 7.97 larger than on machine \(A\) on average. We can also get the parameter estimates of the fixed effects only by calling fixef(fit).

Estimates of the different variance components can be found under Random effects. Often, we are not very much interested in the actual values. We use the random effects as a “tool” to model correlated data and to tell the fitting function that we are interested in the population average and not the worker specific machine effects. In that sense, including random effects in a model changes the interpretation of the fixed effects. Hence, it really depends on the research question whether we treat a factor as fixed or random. There is no right or wrong here.

Approximate confidence intervals can be obtained with confint.

confint(fit, oldNames = FALSE)
##                                    2.5 %    97.5 %
## sd_(Intercept)|Worker:Machine  2.3528037  5.431503
## sd_(Intercept)|Worker          1.9514581  9.410584
## sigma                          0.7759507  1.234966
## (Intercept)                   47.3951611 57.315949
## MachineB                       3.7380904 12.195243
## MachineC                       9.6880904 18.145243

For example, a confidence interval for the expected value of the difference between machine \(A\) and \(B\) is given by \([3.74, 12.2]\).

We do the residual analysis as usual.

## Tukey-Anscombe plot:
plot(fit)

## QQ-plots:
par(mfrow = c(1, 3))
qqnorm(ranef(fit)$Worker[, 1], main = "Random effects of worker")
qqnorm(ranef(fit)$'Worker:Machine'[, 1], main = "Random interaction")
qqnorm(resid(fit), main = "Residuals")

## could also use the following code for generating QQ-plots:
## library(lattice)
## qqmath(ranef(fit))

The Tukey-Anscombe plot looks good. The QQ-plots could look better, however we don’t have a lot of observations.

Again, in order to better understand the mixed effects model we check what happens if we would fit a purely fixed effects model here.

fit.fixed <- aov(score ~ Machine * Worker, data = Machines)
summary(fit.fixed)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Machine         2 1755.3   877.6  949.17 <2e-16 ***
## Worker          5 1241.9   248.4  268.62 <2e-16 ***
## Machine:Worker 10  426.5    42.7   46.13 <2e-16 ***
## Residuals      36   33.3     0.9                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The machine effect is much more significant. This is because the fixed effects model makes a statement about the machine effect of these 6 specific workers and not about the population average (in the same spirit as in the sire example above).

Some technical detail: We can actually get the correct p-value for the mixed effects model from the above fixed effects model output. If we divide the machine mean square by the mean square of the interaction effect we get 20.58. Hence, the p-value of machine is given by

pf(20.58, df1 = 2, df2 = 10, lower.tail = FALSE)
## [1] 0.0002853299

This is exctly the value as in the output of the mixed model from above (up to some rounding issues).

7.2.2 Example: Cheese Data

We continue with Example 12.2 from Oehlert (2010) (with some simulated data). A group of 10 “rural” and 10 “urban” raters rated 4 different cheese types \((A, B, C, D)\). Every rater got to eat two samples from the same cheese type in random order. Hence, we have a total of \(20 \cdot 4 \cdot 2 = 160\) observations.

cheese <- read.table("http://stat.ethz.ch/~meier/teaching/data/cheese.dat", header = TRUE)
cheese[, "rater"] <- factor(cheese[, "rater"])
str(cheese)
## 'data.frame':    160 obs. of  4 variables:
##  $ cheese    : chr  "A" "A" "B" "B" ...
##  $ rater     : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ background: chr  "rural" "rural" "rural" "rural" ...
##  $ y         : int  67 66 51 53 75 70 68 66 76 76 ...

Note that the factor rater has only 10 levels. We have to be careful here and should not forget that rater is actually nested in background (so, the raters labelled “1” in the urban and the rural group have nothing in common).

We can easily visualize this data with an interaction plot. We use the package ggplot2 to get a more appealing plot compared to the function interaction (R code for interested readers only).

ggplot(cheese, aes(x = cheese, y = y, group = interaction(background, rater), 
                   color = background)) + stat_summary(fun.y = mean, geom = "line")

Questions that could arise:

  • Is there a background effect (urban / rural)?
  • Is there a cheese type effect?
  • Does the effect of cheese type depend on background (interaction)?
  • How large is the variability between different raters regarding general cheese liking level?
  • How large is the variability between different raters regarding cheese type preference?

Here, background and cheese type are fixed effects as we want to make a statement about these specific levels. However, we treat rater as random as we want the fixed effects to be population averages and as we are interested in the variation between different raters.

We can use the model \[ Y_{ijkl} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \delta_{k(i)} + (\beta\delta)_{jk(i)} + \epsilon_{ijkl}, \] where

  • \(\alpha_i\) is the (fixed) effect of background,
  • \(\beta_j\) is the (fixed) effect of cheese type
  • \((\alpha\beta)_{ij}\) is the corresponding interaction (fixed effect)
  • \(\delta_{k(i)}\) is the random effect of rater (“general cheese liking level of rater \(k(i)\)”)
  • \((\beta\delta)_{jk(i)}\) is the (random) interaction of rater and cheese type (“specific cheese type preference of rater \(k(i)\)”).

In this setup, there is an “average” (over the whole population) preference “profile” with respect to the 4 different cheese types but every rater can have its own random deviation of this “profile”. This deviation consists of a general shift \(\delta_{k(i)}\) (general cheese liking of a rater) and a rater specific preference \((\beta\delta)_{jk(i)}\) with respect to the 4 cheese types. This is visualized in the following picture (for urban raters only, \(i = 2\)). The situation is basically very similar to the previous example about machines. The solid line is the population average \((= \mu + \alpha_2 + \beta_j + (\alpha\beta)_{2j})\). The dashed line is a parallel shift including the rater specific general cheese liking level \((= \mu + \alpha_2 + \beta_j + (\alpha\beta)_{2j} + \delta_{k(2)})\). The dotted line in addition includes the rater specific cheese type preference \((\mu + \alpha_2 + \beta_j + (\alpha\beta)_{2j} + \delta_{k(2)} + (\beta\delta)_{jk(2)})\). Therefore, the fixed effects have to be interpreted as population averages.

Let us now try to fit this model in R.

We want to have

  • main effects and the interaction for the fixed effects of background and cheese type: background * cheese
  • a random effect per rater (nested in background): (1 | rater:background)
  • a random effect per cheese type and rater (nested in background): (1 | rater:background:cheese).

We always write rater:background to get a unique rater ID. An alternative would be to define another factor in the data set which enumerates the raters from 1 to 20 (instead from 1 to 10).

Hence, the lmer call looks as follows.

library(lmerTest)
fit <- lmer(y ~ background * cheese + (1 | rater:background) + 
              (1 | rater:background:cheese), data = cheese)

We get an ANOVA table with p-values for the fixed effects with the function anova.

anova(fit)
## Type III Analysis of Variance Table with Satterthwaite's method
##                    Sum Sq Mean Sq NumDF DenDF  F value Pr(>F)    
## background          82.13   82.13     1    18   8.2590 0.0101 *  
## cheese            3144.66 1048.22     3    54 105.4149 <2e-16 ***
## background:cheese   15.42    5.14     3    54   0.5168 0.6725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the interaction is not significant but there is a significant effect of background and cheese type. This is what we basically already observed in the interaction plot. There, the profiles were quite parallel, but raters with urban background rated higher on average than those with rural background. In addition, there was a clear difference between different cheese types.

Can we get an intuitive idea about the denominator degrees of freedom that are being used in these tests?

  • The main effect of background can be thought of as a two-sample \(t\)-test with two groups having 10 observations each. Think of taking one (average) value per rater. Hence, we get \(2 \cdot 10 - 2 = 18\) degrees of freedom.
  • As we allow for a rater specific cheese type preference, we have to check whether the effect of cheese is substantially larger than this rater specific variation. The rater specific variation can be thought of as the interaction between rater and cheese type. As rater is nested in background, rater has 9 degrees of freedom in each background group. Hence, the interaction has (\((9 + 9) \cdot 3 = 54\)) degrees of freedom.
  • The same argument as above holds true for the interaction between background and cheese type.

Hence, if we want to get a more precise view about these population average effects, we need to increase the number of raters, which seems pretty natural.

Approximate confidence intervals for the individual coefficients of the fixed effects and the variance components can again be obtained with confint.

confint(fit, oldNames = FALSE)
##                                              2.5 %     97.5 %
## sd_(Intercept)|rater:background:cheese   2.5977830   4.458964
## sd_(Intercept)|rater:background          3.5472552   7.425635
## sigma                                    2.7213541   3.713053
## (Intercept)                             66.4559409  74.744059
## backgroundurban                          2.5394154  14.260584
## cheeseB                                -20.6130977 -13.286902
## cheeseC                                  0.9869023   8.313098
## cheeseD                                 -5.1630977   2.163098
## backgroundurban:cheeseB                 -7.6304024   2.730402
## backgroundurban:cheeseC                 -4.8304024   5.530402
## backgroundurban:cheeseD                 -7.1304024   3.230402

Bibliography

Bates, Douglas, Martin Maechler, Ben Bolker, and Steven Walker. 2020. Lme4: Linear Mixed-Effects Models Using Eigen and S4. https://github.com/lme4/lme4/.

Kuehl, R. O. 2000. Design of Experiments: Statistical Principles of Research Design and Analysis. Duxbury/Thomson Learning.

Kuznetsova, Alexandra, Per Bruun Brockhoff, and Rune Haubo Bojesen Christensen. 2020. LmerTest: Tests in Linear Mixed Effects Models. https://github.com/runehaubo/lmerTestR.

Oehlert, Gary. 2010. A First Course in Design and Analysis of Experiments. http://users.stat.umn.edu/~gary/book/fcdae.pdf.

Scheipl, Fabian. 2020. RLRsim: Exact (Restricted) Likelihood Ratio Tests for Mixed and Additive Models. https://github.com/fabian-s/RLRsim.