- LDA and QDA
- walkthrough 1d example
- Application to the iris dataset
- Bootstrap generalization error
29 April 2016
Say we have a one dimensional vector \(x\), our predictor variable, and an associated vector of class label \(y \in \{1,2\}\).
x | y |
---|---|
-1.1 | 1 |
-0.4 | 1 |
0.9 | 2 |
0.1 | ?? |
x
predict the most likely label y
.Where to put the limit between both classes?
Where to put the limit between both classes?
\[ \begin{align} X |\; (Y=j) & \sim \mathcal{N}(\mu_j, \sigma_j)\\ P(Y=j) &= p_j\\ p_1 + p_2 &= 1 \end{align} \]
x
comes from a normal distribution with mean \(\mu_j\) and variance \(\sigma_j\).Each individual normal distribution:
You are tempted to just draw the separation like this, right?
x
predict the most likely label y
.y
means the label \(j\) for which \(P(Y=j | X=x)=\pi_j(x)\) is the highest.Our previous attempt was wrong because in general:
\[ p(Y | X) \neq p(X | Y) \]
\[ P(Y=j | X ) = \frac{P(X|Y=j) \cdot p_j} {\sum_i P(X|Y=i) \cdot p_i} \]
Caution: Here \(P(X|Y=j)\) should be interpreted as a pdf!
Here we go: plot of \(P(X|Y=j) \cdot p_j\) for both classes
Ok, but how do we find this decision boundary exactly? Let's do it for our simple example, and you will do it for the general case in the exercise.
Step 1 is quite straightforward.
Step 2 gives the following:
\[ \pi_j(x) = \frac{1}{\sqrt{2 \pi \sigma_j^2}} \exp \Big( -\frac{1}{2} \frac{(x-\mu_j)^2}{\sigma_j^2} \Big) \; p_j \]
For Step 3 you have different options… You can either solve directly for \(\pi_1(x)=\pi_2(x)\) or first construct the discriminant function \(\delta_j(x)\) and then compare them. This is really the same thing, but it helps to understand how the method works.
Discriminant functions
We only need to find which of the posterior has the maximum value at x
: \(argmax_j \pi_j(x)\).
Realize that: \[ \text{argmax} \pi_j(x) = \text{argmax} \log( \pi_j(x) ) = \text{argmax} (c + \log(\pi_j(x))). \]
So we are going to take the log of \(\pi_j(x)\) and subtract all the terms that do not depend on \(j\).
Discriminant functions \[ \begin{align} \delta_j(x) &\propto \pi_j(x) \\ &\propto \frac{1}{\sqrt{2 \pi \sigma_j^2}} \exp \Big( -\frac{1}{2} \frac{(x-\mu_j)^2}{\sigma_j^2} \Big) \; p_j\\ &\propto - \log(\sqrt{2 \pi \sigma_j^2 }) - \frac{1}{2} (x-\mu_j)^2 / \sigma_j^2 + \log(p_j) \\ &\propto - \log(\sigma_j) - \frac{1}{2} (x-\mu_j)^2 / \sigma_j^2 + \log(p_j) \\ \end{align} \]
Let's plot these discriminant functions \(\delta_j(x)\):
LDA: assume that \(\sigma_1=\sigma_2\), we get a linear function!
\[ \begin{align} \delta_j(x) &\propto - \log(\sigma_j) - \frac{1}{2} (x-\mu_j)^2 / \sigma_j^2 + \log(p_j) \\ &\propto -\frac{1}{2} (x^2 - x\mu_j + \mu_j^2) / \sigma_j^2 + \log(p_j) \\ &\propto -\frac{1}{2} (- 2 x\mu_j + \mu_j^2) / \sigma_j^2 + \log(p_j) \end{align} \]
LDA discriminant functions:
Your task in exercise 1 a) and b) is to find these discriminant functions for the general p-dimensional case. Just apply the same principle but be careful with the linear algebra.
For exercise 1 c) you need to characterize the boundary of LDA:
lda
and qda
from the MASS
package.Remark: In step 3, plotting the decision boundary manually in the case of LDA is relatively easy. You can use the characterization of the boundary that we found in task 1c). However a general strategy to plot more complex decision boundaries is to predict the class \(y\) for all combinations of predictors \(x_1\) and \(x_2\) on a fine grid, and then do a contour plot. The funtion predplot
given to you in the skeleton does exactly this!
Bootstrap the estimates of the \(\mu_j\). You can either:
lda
.lda
).Just choose whichever method you find most intuitive.
Bootstrap the decision boundaries of LDA and QDA:
lines.only=TRUE
and transparent color for the lines)Goal: Estimate \(Err= E[ \rho(Y^{new}, \hat{m}(X^{new}))]\)
Idea: use Bootstrap to compute the expectation in Step 3!
Procedure:
What is the probability that the original observation \((X_i, Y_i)\) is included in the bootstrap sample \(b\)?
\[ b^{*}= \{Z_1^{*}, \dots, Z_n^{*} \}, \quad Z_i^{*}=(X_i^{*}, Y_i^{*}) \] \[ \begin{align} P( Z_i \in b^{*}) &= 1- P(Z_i \notin b^{*})\\ &= 1- P( Z_i \neq Z_1^{*}) \cdot P( Z_i \neq Z_2^{*}) \dots\\ &= 1- P( Z_i \neq Z_1^{*})^n \\ &= 1- (1-\frac{1}{n})^n\\ &\approx 1 - e^{-1} = 0.632 \end{align} \]
Solution: Out-of-bootstrap estimate (OOB)
Replace Step 4 in the procedure above by:
What is \(n^{OOB}\) on average?
In every bootstrap sample there are on average \(0.632 \cdot n\) different data points
Therefore \(\hat{Err}^{OOB} > Err\), we overestimate the error…
Solution: combine the overoptimistic in-sample bootstrap error \(\hat{Err}\) with the pessimistic OOB estimate \(\hat{Err}^{OOB}\):
\[ \hat{Err}^{.632} = 0.368 \cdot \hat{Err} + 0.632 \cdot \hat{Err}^{OOB} \]