# Chapter 10 Power

## 10.1 Introduction

By construction, a statistical test controls the so-called type I error
rate with the significance level \(\alpha\). This means the probability
that we *falsely* reject the null hypothesis \(H_0\) is less than \(\alpha\).
Besides the type I error, there is also a type II error. It occurs if we
fail to reject the null hypothesis even though the alternative
hypothesis \(H_A\) holds. The probability of a type II error is typically
denoted by \(\beta\) (and we are *not* controlling it). Note that there is no
“universal” \(\beta\), it actually depends on the specific alternative \(H_A\) that
we assume. There
is no such thing as “the” alternative \(H_A\), we have to do a decision
here. An easy example is the one-sample \(t\)-test where we could have
\[\begin{align*}
H_0: \mu & = 0 \\
H_A: \mu & \neq 0
\end{align*}\]
We can for example take \(\mu = 1 \in H_A\) and try to calculate
\(\beta\). Intuitively, it seems clear that the further away we choose
\(\mu\) from \(H_0\), the *smaller* will be the probability of a type II
error.

The **power** of a statistical test (for a certain
alternative) is defined as
\[
P(\textrm{reject $H_0$ given that a certain setting under the alternative $H_A$ holds}) = 1 - \beta.
\]
It translates as “what is the probability to get a significant result,
assuming a certain setting under the alternative \(H_A\) holds”. In the
previous example: if we believe that \(\mu = 1\) is true, then power
is the probability for getting a significant result for this specific
setting. Again: The further away \(\mu\) from \(H_0\), the *larger* will be
power.

Calculating power is like a “thought experiment”. We do *not* need data
but a *precise* specification of the parameter setting under the
alternative that we believe in (“what would happen if …”).

## 10.2 Calculating Power for a Certain Design

Why should we be interested in such a concept when planning an experiment? Power can be thought of as the probability of “success” (= getting a significant result). If we plan an experiment with low power, it means that we waste time and money because with high probability we are not getting a significant result. A “rule of thumb” is that power should be larger than 80%.

What does power actually depend on? Power depends on

- design of the experiment (balanced, unbalanced, without blocking, with blocking, …)
- significance level \(\alpha\)
- parameter setting under the alternative (incl. error variance \(\sigma^2\))
- sample size \(n\)

We can mainly maximize power using the first (design) and the last item (sample size) from above.

For “easy” designs like a completely randomized design, there are closed-form formulas to calculate power. As soon as the design is getting more complex, things are typically not tractable anymore. However, this is no problem. What we can always do is to simulate a lot of data sets under the alternative that we believe in and see how often we are rejecting the corresponding null hypothesis.

We have a look at both approaches using a one-way ANOVA model with five groups (that is we have \(\mu_1, \mu_2, \ldots, \mu_5\)). The null hypothesis is \[ H_0: \mu_1 = \ldots = \mu_5 \] while the alternative hypothesis is \[ H_A: \mu_k \neq \mu_l \textrm{ for at least one pair } k \neq l. \] For example we could assume that under \(H_A\) we have \[ \mu_1 = 57, \, \mu_2 = 63, \, \mu_3 = 60, \, \mu_4 = 60, \, \mu_5 = 60. \] In addition we have to specify the error variance. We assume it to be \(\sigma^2 = 7.5\).

For such an easy design, we can use the function `power.anova.test`

to
calculate the power. It needs the number of groups as input (here: five),
the variance between the group means \(\mu_i\), the error variance
\(\sigma^2\) and the sample size within each group (assuming a balanced
design). By default it uses a 5% significance level. If we have \(n = 4\)
observations in each group we get:

```
mu <- c(57, 63, rep(60, 3))
sigma2 <- 7.5
power.anova.test(groups = length(mu), n = 4, between.var = var(mu), within.var = sigma2)
```

```
##
## Balanced one-way analysis of variance power calculation
##
## groups = 5
## n = 4
## between.var = 4.5
## within.var = 7.5
## sig.level = 0.05
## power = 0.5452079
##
## NOTE: n is number in each group
```

This means: We have a 54% chance to get a significant result under the above settings.

We can also leave away the argument `n`

and use the argument `power`

to
get the required sample size (per group) for a certain power (here:
80%).

```
##
## Balanced one-way analysis of variance power calculation
##
## groups = 5
## n = 6.002296
## between.var = 4.5
## within.var = 7.5
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
```

This means for getting a power of 80% we need 6 observations per group.

What happens if we are facing a situation where there is no dedicated
function like `power.anova.test`

? We can easily calculate power using
simulations. Based on our specific setting under the alternative we are
simulating many times a new data set, fit the one-way ANOVA model and
check whether the corresponding \(F\)-test is significant or not.

```
n.sim <- 1000 ## number of simulations
mu <- c(57, 63, rep(60, 3)) ## group means
sigma2 <- 7.5 ## error variance
n <- 4 ## number of observations per group
g <- length(mu)
group <- factor(rep(LETTERS[1:g], each = n))
results <- numeric(n.sim) ## vector to store results in
for(i in 1:n.sim){
## Simulate new response, build data set
y <- rnorm(n * g, mean = rep(mu, each = n), sd = sqrt(sigma2))
data <- data.frame(y = y, group = group)
## Fit one-way ANOVA model
fit <- aov(y ~ group, data = data)
## Extract result of global F-test
results[i] <- summary(fit)[[1]][1, "Pr(>F)"] < 0.05 ## 1 = reject
}
mean(results) ## proportion of simulation runs that rejected H_0
```

`## [1] 0.558`

We get the result that in 56% of the
cases we get a significant result. This is very close to the result from
above. If we wanted to increase the precision of our calculations we
could easily do this by using a larger value for the number of
simulation runs `n.sim`

.

A nice side-effect of doing a power analysis is that you actually do the whole data analysis on simulated data and you immediately see whether this works as intended.

From a conceptual point of view we can use such a simulation-based procedure for any design. However, the number of parameters grows rather quickly with increasing model complexity and some of them are hard to specify in practice (for example the error variance, or the variances of different random effects). If we are lucky we have data from a pre-study. Unfortunately, the variance estimates are quite imprecise if we only have very limited amount of data.

In that sense, the results of a power analysis are typically not very
precise. However, they should still give us a *rough* idea about the
required sample size in the sense whether we rather need 6 or 60
observations per group.