29 April 2016

Outline

  • LDA and QDA
  • walkthrough 1d example
  • Application to the iris dataset
  • Bootstrap generalization error

Task 1: derivation of LDA and QDA

1 dimensional example

Say we have a one dimensional vector \(x\), our predictor variable, and an associated vector of class label \(y \in \{1,2\}\).

x y
-1.1 1
-0.4 1
0.9 2
0.1 ??

Goal: given a new x predict the most likely label y.

1 dimensional example

Where to put the limit between both classes?

1 dimensional example

Where to put the limit between both classes?

1 dimensional example

The model:

\[ \begin{align} X |\; (Y=j) & \sim \mathcal{N}(\mu_j, \sigma_j)\\ P(Y=j) &= p_j\\ p_1 + p_2 &= 1 \end{align} \]

In words:

  • A priori there is a probability \(p_1\) that you come from class 1, and \(p_2=(1-p_1)\) that you come from class 2.
  • If you belong to class j, then your value x comes from a normal distribution with mean \(\mu_j\) and variance \(\sigma_j\).

1 dimensional example

Each individual normal distribution:

1 dimensional example

You are tempted to just draw the separation like this, right?

1 dimensional example

Goal: given a new x predict the most likely label y.

  • most likely y means the label \(j\) for which \(P(Y=j | X=x)=\pi_j(x)\) is the highest.

How to compute \(\pi_j(x)\) for each of the possible class \(j\)?

Our previous attempt was wrong because in general:

\[ p(Y | X) \neq p(X | Y) \]

1 dimensional example

Bayes theorem

\[ P(Y=j | X ) = \frac{P(X|Y=j) \cdot p_j} {\sum_i P(X|Y=i) \cdot p_i} \]

Caution: Here \(P(X|Y=j)\) should be interpreted as a pdf!

1 dimensional example

Here we go: plot of \(P(X|Y=j) \cdot p_j\) for both classes

1 dimensional example

Ok, but how do we find this decision boundary exactly? Let's do it for our simple example, and you will do it for the general case in the exercise.

  1. Estimate \(p_j\), \(\mu_j\) and \(\sigma_j\) from the data.
  2. Compute the posterior densities \(\pi_j(x)\).
  3. Find the value (or set of values for more dimensions) for which \(\pi_1(x)=\pi_2(x)\).

1 dimensional example

Step 1 is quite straightforward.

Step 2 gives the following:

\[ \pi_j(x) = \frac{1}{\sqrt{2 \pi \sigma_j^2}} \exp \Big( -\frac{1}{2} \frac{(x-\mu_j)^2}{\sigma_j^2} \Big) \; p_j \]

For Step 3 you have different options… You can either solve directly for \(\pi_1(x)=\pi_2(x)\) or first construct the discriminant function \(\delta_j(x)\) and then compare them. This is really the same thing, but it helps to understand how the method works.

1 dimensional example

Discriminant functions

We only need to find which of the posterior has the maximum value at x: \(argmax_j \pi_j(x)\).

Realize that: \[ \text{argmax} \pi_j(x) = \text{argmax} \log( \pi_j(x) ) = \text{argmax} (c + \log(\pi_j(x))). \]

So we are going to take the log of \(\pi_j(x)\) and subtract all the terms that do not depend on \(j\).

1 dimensional example

Discriminant functions \[ \begin{align} \delta_j(x) &\propto \pi_j(x) \\ &\propto \frac{1}{\sqrt{2 \pi \sigma_j^2}} \exp \Big( -\frac{1}{2} \frac{(x-\mu_j)^2}{\sigma_j^2} \Big) \; p_j\\ &\propto - \log(\sqrt{2 \pi \sigma_j^2 }) - \frac{1}{2} (x-\mu_j)^2 / \sigma_j^2 + \log(p_j) \\ &\propto - \log(\sigma_j) - \frac{1}{2} (x-\mu_j)^2 / \sigma_j^2 + \log(p_j) \\ \end{align} \]

1 dimensional example

Let's plot these discriminant functions \(\delta_j(x)\):

1 dimensional example

LDA: assume that \(\sigma_1=\sigma_2\), we get a linear function!

\[ \begin{align} \delta_j(x) &\propto - \log(\sigma_j) - \frac{1}{2} (x-\mu_j)^2 / \sigma_j^2 + \log(p_j) \\ &\propto -\frac{1}{2} (x^2 - x\mu_j + \mu_j^2) / \sigma_j^2 + \log(p_j) \\ &\propto -\frac{1}{2} (- 2 x\mu_j + \mu_j^2) / \sigma_j^2 + \log(p_j) \end{align} \]

1 dimensional example

LDA discriminant functions:

Extension to \(p\) dimensions

Your task in exercise 1 a) and b) is to find these discriminant functions for the general p-dimensional case. Just apply the same principle but be careful with the linear algebra.

For exercise 1 c) you need to characterize the boundary of LDA:

  • In our 1-dimensional example the boundary was just one point, but in p dimension the boundary is a (p-1) hyperplane.
  • You can find it by following the same principle we did here: write down \(\delta_1(x) = \delta_2(x)\) and gives the solution as a function of \(x\).
  • It should be of the form: \(B=\{x \; |\; x'w = a\}\).

Additional remarks

  • We usually estimate \(p_j\) as \(n_j/n\), but it doesn't have to be the case. You could also choose something else, say \(p_j=1/2\) for two classes. Alternatively you could try to find the \(p_j\) that produce the smallest CV error.
  • For those interested, there is an interesting connection between LDA and PCA: look at section 4.3.3 of the reference book.

Task 2: application to the iris dataset

The data

Class label: Iris Setosa, Iris Versicolor or Iris Virginica?

Predictors: petal/sepal length and width:

Task a)

  1. load the data and select 2 predictors.
  2. fit with lda and qda from the MASS package.
  3. plot the the resulting decision boundary.

Remark: In step 3, plotting the decision boundary manually in the case of LDA is relatively easy. You can use the characterization of the boundary that we found in task 1c). However a general strategy to plot more complex decision boundaries is to predict the class \(y\) for all combinations of predictors \(x_1\) and \(x_2\) on a fine grid, and then do a contour plot. The funtion predplot given to you in the skeleton does exactly this!

Task a)

Task b)

Bootstrap the estimates of the \(\mu_j\). You can either:

  • Follow the hints in the R skeleton which uses the estimates from lda.
  • Directly compute the bootstrap estimates of each individual mean (no need of lda).

Just choose whichever method you find most intuitive.

Task c)

Bootstrap the decision boundaries of LDA and QDA:

  • In a for loop, fit the model using the bootstrapped data.
  • At each iteration plot the resulting decision boundaries (with arguments lines.only=TRUE and transparent color for the lines)

Bootstrap generalization error

Goal: Estimate \(Err= E[ \rho(Y^{new}, \hat{m}(X^{new}))]\)

  1. \(\hat{m}(x)\) is our LDA or QDA classifier.
  2. \(\rho(Y^{new}, Y^{pred})\) is the (0,1) loss, which is 0 when the prediction is correct and 1 otherwise.
  3. \(E[]\) indicates that we want the expectation over the data generating process.

Idea: use Bootstrap to compute the expectation in Step 3!

Bootstrap generalization error

Procedure:

  1. Generate bootstrapped data: \((X_1^{*}, Y_1^{*}), \dots, (X_n^{*}, Y_n^{*})\)
  2. Train the classifier \(\hat{m}^{*}(x)\).
  3. Predict on the original dataset \(Y_i^{pred}=\hat{m}^{*}(X_i)\).
  4. Evaluate mean loss over the \(n\) data points: \(\frac{1}{n} \sum_{i=1}^n \rho(Y_i, Y_i^{pred})\).
  5. Estimate \(\hat{Err}\) with average over \(B\).

Any problem with that?

Bootstrap generalization error

What is the probability that the original observation \((X_i, Y_i)\) is included in the bootstrap sample \(b\)?

\[ b^{*}= \{Z_1^{*}, \dots, Z_n^{*} \}, \quad Z_i^{*}=(X_i^{*}, Y_i^{*}) \] \[ \begin{align} P( Z_i \in b^{*}) &= 1- P(Z_i \notin b^{*})\\ &= 1- P( Z_i \neq Z_1^{*}) \cdot P( Z_i \neq Z_2^{*}) \dots\\ &= 1- P( Z_i \neq Z_1^{*})^n \\ &= 1- (1-\frac{1}{n})^n\\ &\approx 1 - e^{-1} = 0.632 \end{align} \]

Danger of over-optimism (\(\hat{Err} < Err\))

Bootstrap generalization error

Solution: Out-of-bootstrap estimate (OOB)

Replace Step 4 in the procedure above by:

  1. Evaluate mean loss over the \(n^{OOB}\) data points which are not in \(b^{*}\): \[ \frac{1}{n^{OOB}} \sum_{i=1}^n 1_{(Z_i \notin b^{*})} \; \rho(Y_i, Y_i^{pred}). \]

What is \(n^{OOB}\) on average?

Bootstrap generalization error

In every bootstrap sample there are on average \(0.632 \cdot n\) different data points

Therefore \(\hat{Err}^{OOB} > Err\), we overestimate the error…

Solution: combine the overoptimistic in-sample bootstrap error \(\hat{Err}\) with the pessimistic OOB estimate \(\hat{Err}^{OOB}\):

\[ \hat{Err}^{.632} = 0.368 \cdot \hat{Err} + 0.632 \cdot \hat{Err}^{OOB} \]

Thank you! Questions?