Chapter 10 Power

10.1 Introduction

By construction, a statistical test controls the so-called type I error rate with the significance level \(\alpha\). This means the probability that we falsely reject the null hypothesis \(H_0\) is less than \(\alpha\). Besides the type I error, there is also a type II error. It occurs if we fail to reject the null hypothesis even though the alternative hypothesis \(H_A\) holds. The probability of a type II error is typically denoted by \(\beta\) (and we are not controlling it). Note that there is no “universal” \(\beta\), it actually depends on the specific alternative \(H_A\) that we assume. There is no such thing as “the” alternative \(H_A\), we have to do a decision here. An easy example is the one-sample \(t\)-test where we could have \[\begin{align*} H_0: \mu & = 0 \\ H_A: \mu & \neq 0 \end{align*}\] We can for example take \(\mu = 1 \in H_A\) and try to calculate \(\beta\). Intuitively, it seems clear that the further away we choose \(\mu\) from \(H_0\), the smaller will be the probability of a type II error.

The power of a statistical test (for a certain alternative) is defined as \[ P(\textrm{reject $H_0$ given that a certain setting under the alternative $H_A$ holds}) = 1 - \beta. \] It translates as “what is the probability to get a significant result, assuming a certain setting under the alternative \(H_A\) holds.” In the previous example: if we believe that \(\mu = 1\) is true, then power is the probability for getting a significant result for this specific setting. Again: The further away \(\mu\) from \(H_0\), the larger will be power.

Calculating power is like a “thought experiment.” We do not need data but a precise specification of the parameter setting under the alternative that we believe in (“what would happen if …”).

10.2 Calculating Power for a Certain Design

Why should we be interested in such a concept when planning an experiment? Power can be thought of as the probability of “success” (= getting a significant result). If we plan an experiment with low power, it means that we waste time and money because with high probability we are not getting a significant result. A “rule of thumb” is that power should be larger than 80%.

What does power actually depend on? Power depends on

  • design of the experiment (balanced, unbalanced, without blocking, with blocking, …)
  • significance level \(\alpha\)
  • parameter setting under the alternative (incl. error variance \(\sigma^2\))
  • sample size \(n\)

We can mainly maximize power using the first (design) and the last item (sample size) from above.

For “easy” designs like a completely randomized design, there are closed-form formulas to calculate power. As soon as the design is getting more complex, things are typically not tractable anymore. However, this is no problem. What we can always do is to simulate a lot of data sets under the alternative that we believe in and see how often we are rejecting the corresponding null hypothesis.

We have a look at both approaches using a one-way ANOVA model with five groups (that is we have \(\mu_1, \mu_2, \ldots, \mu_5\)). The null hypothesis is \[ H_0: \mu_1 = \ldots = \mu_5 \] while the alternative hypothesis is \[ H_A: \mu_k \neq \mu_l \textrm{ for at least one pair } k \neq l. \] For example we could assume that under \(H_A\) we have \[ \mu_1 = 57, \, \mu_2 = 63, \, \mu_3 = 60, \, \mu_4 = 60, \, \mu_5 = 60. \] In addition we have to specify the error variance. We assume it to be \(\sigma^2 = 7.5\).

For such an easy design, we can use the function power.anova.test to calculate the power. It needs the number of groups as input (here: five), the variance between the group means \(\mu_i\), the error variance \(\sigma^2\) and the sample size within each group (assuming a balanced design). By default it uses a 5% significance level. If we have \(n = 4\) observations in each group we get:

mu     <- c(57, 63, rep(60, 3))
sigma2 <- 7.5
power.anova.test(groups = length(mu), n = 4, between.var = var(mu), within.var = sigma2)
##      Balanced one-way analysis of variance power calculation 
##          groups = 5
##               n = 4
##     between.var = 4.5
##      within.var = 7.5
##       sig.level = 0.05
##           power = 0.5452079
## NOTE: n is number in each group

This means: We have a 54% chance to get a significant result under the above settings.

We can also leave away the argument n and use the argument power to get the required sample size (per group) for a certain power (here: 80%).

power.anova.test(groups = length(mu), between.var = var(mu), within.var = sigma2, 
                 power = 0.8)
##      Balanced one-way analysis of variance power calculation 
##          groups = 5
##               n = 6.002296
##     between.var = 4.5
##      within.var = 7.5
##       sig.level = 0.05
##           power = 0.8
## NOTE: n is number in each group

This means for getting a power of 80% we need 6 observations per group.

What happens if we are facing a situation where there is no dedicated function like power.anova.test? We can easily calculate power using simulations. Based on our specific setting under the alternative we are simulating many times a new data set, fit the one-way ANOVA model and check whether the corresponding \(F\)-test is significant or not.

n.sim  <- 1000                  ## number of simulations
mu     <- c(57, 63, rep(60, 3)) ## group means
sigma2 <- 7.5                   ## error variance
n      <- 4                     ## number of observations per group

g     <- length(mu) 
group <- factor(rep(LETTERS[1:g], each = n))

results <- numeric(n.sim) ## vector to store results in

for(i in 1:n.sim){
  ## Simulate new response, build data set
  y <- rnorm(n * g, mean = rep(mu, each = n), sd = sqrt(sigma2))
  data <- data.frame(y = y, group = group)
  ## Fit one-way ANOVA model
  fit  <- aov(y ~ group, data = data)
  ## Extract result of global F-test
  results[i] <- summary(fit)[[1]][1, "Pr(>F)"] < 0.05 ## 1 = reject

mean(results) ## proportion of simulation runs that rejected H_0
## [1] 0.558

We get the result that in 56% of the cases we get a significant result. This is very close to the result from above. If we wanted to increase the precision of our calculations we could easily do this by using a larger value for the number of simulation runs n.sim.

A nice side-effect of doing a power analysis is that you actually do the whole data analysis on simulated data and you immediately see whether this works as intended.

From a conceptual point of view we can use such a simulation-based procedure for any design. However, the number of parameters grows rather quickly with increasing model complexity and some of them are hard to specify in practice (for example the error variance, or the variances of different random effects). If we are lucky we have data from a pre-study. Unfortunately, the variance estimates are quite imprecise if we only have very limited amount of data.

In that sense, the results of a power analysis are typically not very precise. However, they should still give us a rough idea about the required sample size in the sense whether we rather need 6 or 60 observations per group.