Chapter 2 Learning from Data
2.1 Cause-Effect Relationships
Why do people like to collect (nowadays: big) data? Typically, the goal is to find “relationships” between different “variables”: What fertilizer combination makes my plants grow tall? What medication reduces headache most?
This means that we are in the (abstract) situation where we have a “system” or a “process” with many input variables (predictors) and an output (response). In the previous example the ingredients of the fertilizer are the predictors and the output could be the biomass of a plant.
Ideally, we would like to find cause-effect relationships, meaning that when we actively change one of the inputs (i.e., we make an intervention), this will cause the output to change. If we can just “observe” a system under different settings (so-called observational studies or observational data), it is much harder to make a statement about causal effects, as can be seen in the following examples:
Is the seatbelt sign on an airplane causing a plane to shake? If we could switch it on ourselves, would the plane start shaking? (see Family Circus: “I wish they didn’t turn on that seatbelt sign so much! Every time they do, it gets bumpy.”)
Are ice cream sales causing people to drown? If we would stop selling ice cream, would drowning decrease (or even stop)?
With observational data, we can typically just make a statement about an association between two variables. One potential danger is the existence of confounding variables (short: confounders). A confounder is a common cause for two variables. In the previous example we had the following situation:
Turbulent weather at the same time makes the pilot switch on the seatbelt sign and the plane shake. What we observe is an association between the appearance of the seatbelt sign and a shaking plane. The seatbelt sign is not a cause of the shaking plane.
Hot weather makes people want to go swimming (perhaps after drinking alcohol) and at the same time is beneficial for ice cream sales. What we observe is an association between ice cream sales and the number of drowning incidents. Ice cream sales is not a cause of the number of drowning incidents.
A more “classical” example is (was!) the question whether smoking causes lung-cancer or maybe “bad genes” make people smoke and develop lung cancer at the same time? More examples of spurious associations can be found on http://www.tylervigen.com/spurious-correlations.
To find out cause-effect relationships, we should ideally be able to make an intervention on a system ourselves. If we would occasionally switch on the seatbelt sign (ignoring the weather conditions), we would see that the plane will not start shaking. This is what we do in experimental studies. We actively change the inputs of the process and we observe what happens with the output.
Remark: It is also possible to make a statement about causal effects using observational data. To do so, we would need to know the underlying “causal diagram” (typically unknown in practice) where direct causal effects are visualized by arrows. A set of rules would then tell us what variables we have to consider in our analysis (“conditioning”). In the seatbelt sign example, if we also consider weather condition, we would see that there is no causal effect from the seatbelt sign on the movements of the plane. The corresponding causal diagram would look as follows.
Causal relationships are represented by arrows. A more complicated example could look like this:
If we are interested in the total causal effect of “In” on “Out” using observational data we would need to believe that this causal diagram is really representing the truth (in particular, we did not forget any important variables) and we have to derive the correct set of variables to condition on (here: only “D”). On the other side, if we can do an experiment, we simply make an intervention on variable “In” and see what happens with the output “Out,” we don’t have to know the underlying causal diagram.
2.2 Experimental Studies
Before designing an experimental study we must have a focused and
precise research question that we want to answer with experimental
data. E.g., “how does fertilizer \(A\) compare to fertilizer \(B\) with
respect to biomass of plants after 10 weeks?” Quite often, people
collect large amounts of data and afterwards think “let’s see whether we
find some interesting patterns in our data!” Such an approach is
permissible in order to create some research question. However, we
focus on the part where we want to confirm a certain specific
conjecture. We have to make sure that it is actually testable, i.e.
that we can do the appropriate interventions and that we can measure the right response (see below).
An experimental study consists of different “ingredients”:
the different interventions which we perform on the system: the different “treatments,” e.g. different fertilizer combinations that we are interested in, but also other predictors of the system.
experimental units: the “things” (“subjects,” “objects,” …) to which we apply the treatments, e.g. plots of land receiving fertilizer.
a method that assigns treatments to experimental units, typically randomization and restricted randomization (see later).
response(s): the output that we measure, e.g. biomass of plants.
In addition, when designing an experimental study, the analysis of the resulting data should already be considered. E.g., we need an idea of the experimental error (see below), otherwise we cannot do any statistical inference. It is always a good idea to try to analyze some simulated data before performing the experiment, as this can potentially already reveal some serious flaws of the design.
2.2.1 Predictors / Treatments
We distinguish between the following types of predictors:
Predictors that are of primary interest and that can (ideally) be varied according to our “wishes”: the conditions we want to compare, or the „treatments“. E.g., fertilizer type.
Predictors that are systematically recorded such that potential effects can later be eliminated in our calculations (“controlling for …”). E.g., weather conditions.
Predictors that can be kept constant and whose effects can therefore be eliminated. E.g., using always the same measurement device.
Predictors that we can neither record nor keep constant. E.g., some special soil properties that we cannot measure.
Randomization, i.e., the random allocation of experimental units to the different treatments, ensures that the only systematic difference between the different treatment “groups” is the treatment. This protects us from confounders and is the reason why a properly randomized experiment allows us to make a statement about causality between treatment and response.
“Randomization generally costs little in time and trouble, but it can save us from disaster” (Oehlert 2010).
Typically, we should also randomize
the order in which experimental units are used (if not used simultaneously), as this will protect us from time being a confounder
locations at which experimental units are used (if not all at the same location)
if using multiple measuring instruments: randomize which experimental units are measured on which instruments.
Quite often, we already know that some experimental units are more alike than others before doing the experiment. Think for example of different locations for an agricultural experiment. Typically, we then do a randomization “within” homogeneous blocks (here: at each location). This is called blocking. A block is a subset of experimental units that is more homogenous than the entire set. Blocking (typically) increases precision of an experiment.
According to George Box: “Block what you can; randomize what you cannot.”
2.2.3 Experimental and Measurement Units
An experimental unit is defined as the “thing” to which we apply the treatments by randomization. The general rule is (Oehlert 2010): “An experimental unit should be able to receive any treatment independently of the other units.” On the other side, a measurement unit is the unit on which the response is being measured. There are (many) cases where experimental units and measurement units are not the same. This can cause (severe) consequences on the analysis if not treated appropriately.
If we randomize different food supplies to cages of animals, the experimental unit is given by a cage and not an individual animal. However, the measurement unit will be given by an individual animal of the cage (e.g., we could take as response the body weight of an animal after 5 weeks).
Typically, we aggregate the values of the measurement units such that we get one value per experimental unit (e.g., take the average body weight per cage). These values will typically be the basis for the statistical analysis.
As we want our results to have broad validity, the experimental units should ideally be a random sample from the population of interest. If the experimental units do not represent the population well, extrapolation from the experimental results to the population will be challenging.
The response should be chosen such that it reflects useful information about the process under study. The response is what you measure to judge what happened in the process. It is your responsibility that the response is a reasonable quantity to study your research hypothesis. If not directly measurable, use a so-called surrogate response (e.g., some specific cell counts as a surrogate for disease progression).
2.2.5 Experimental Error
Consider the following hypothetical example: We make an experiment using two plants. One gets fertilizer \(A\) and the other one \(B\). After 4 weeks we observe that the plant receiving fertilizer \(A\) has twice the biomass of the plant receiving fertilizer \(B\). Can we conclude that fertilizer \(A\) is causing larger biomass? Unfortunately, we cannot say so in this situation, even if we randomized the plants to the two fertilizers. The experiment doesn’t give us any information whether the difference that we observe is larger than the natural variation from plant to plant. It could very well be that there is no difference between the two fertilizers meaning that the difference that we observe is just natural variation from plant to plant.
Different experimental units will (always) give different responses to the same treatment. We should design our experiment such that we get an idea of this so-called experimental error. Here, this means we would need multiple plants (replicates) receiving the same treatment. If the difference between the treatments is (much) larger than the experimental error, we can conclude that there is a treatment effect (caused by the fertilizer). You’ve learned analyzing such data with a two-sample \(t\)-test (which would not work with only one observation in each treatment group).
A “true” replicate should be given by another (independent) experimental unit. If we would measure each of the above plants 10 times, we would also have multiple measurements per treatment group, but the error that we observe is simply the measurement error of our measurement device, hence we still have no clue about the experimental error. We would call them pseudoreplicates. From a technical point of view we could do a two-sample \(t\)-test. However, we could just conclude that these two specific plants are different (for whatever reason). This is not what you typically want to do.
2.2.6 More Terminology
Blinding means that the evaluators don’t know which treatment is given to which experimental unit. With humans it is common to use double-blinding: Neither the evaluators nor the patients know the assignment. This protects us from (unintentional) bias due to expectations.
A control treatment is a “standard” treatment used as a baseline for comparison with other treatments. It can also be no treatment at all (“null treatment”). You should always ask yourselve: “How does it compare to the null or the standard treatment?”
A placebo is a “null treatment” for situations where the act of applying a treatment (whatever) potentially has an effect. “I feel better because I got a pill!”
2.2.7 A Few Examples
What’s wrong with the following examples?
Mike is interested in the difference between two teaching techniques. He randomly selects 10 ETH lecturers that apply technique \(A\) and 10 EPFL lecturers that will apply technique \(B\). Each lecturer reports the average grade of his class.
Gian has two cages of rats. Rats in cage 1 get a special food supply while rats in cage 2 get “ordinary” food (control treatment).
Melanie offers a new exam preparation course. She claims: On average, only 20% of my students fail the exam.