Principal Components Analysis (PCA) is a well-established method of dimension reduction. It is often used as a means of gaining insight into the “hidden meanings” in a dataset. But in prediction contexts–ML–it is mainly a technique for avoiding overfitting and excess computation.

This tutorial will give a more concrete, more real-world-oriented, overview of PCA than those given in most treatments.

A number of “nonlinear versons” of PCA have been developed, including Uniform Manifold Approximation and Projection (UMAP), which we will also discuss briefly here.

Facility in matrix algebra is needed to fully understand the treatment here, but those without such background should be able to follow most of it. The contents of the Overfitting vignette are also helpful here.

**All PCA does is form new columns from the original columns
of our data.** Those new columns form our new feature set.
It’s that simple.

Each new column is some linear combination of our original columns. Moreover, the new columns are uncorrelated with each other, the importance of which we will also discuss.

Let’s see concretely what all that really means.

Consider **mlb**, a dataset included with **qeML**. We will look at
heights, weights and ages of American professional baseball players.
(We will delete the first column, which records position played.)

```
> data(mlb1)
> mlb <- mlb1[,-1]
> head(mlb)
Height Weight Age
1 74 180 22.99
2 74 215 34.69
3 72 210 30.78
4 72 210 35.43
5 73 188 35.71
6 69 176 29.39
> dim(mlb)
[1] 1015 3
# 1015 players, 3 measurements each
# PCA requires matrix format
> mlb <- as.matrix(mlb)
```

The standard PCA function in R is **prcomp**:

```
# defaults center=TRUE, scale.=FALSE)
> z <- prcomp(mlb)
```

We will look at the contents of **z** shortly. But first, it is key to
note that all that is happening is that *we started with 3 variables*,
Height, Weight and Age, and now *have created 3 new variables*, **PC1**,
**PC2** and **PC3**. Those new variables are stored in the matrix
**z$x**. Again, we will see the details below, but the salient point
is: 3 new features.

We originally had 3 measurements on each of 1015 people, and now we have 3 new measurements on each of those people:

```
> dim(z$x)
[1] 1015 3
```

Well then, what *is* in there in **z**?

```
> str(z)
List of 5
$ sdev : num [1:3] 20.87 4.29 1.91
$ rotation: num [1:3, 1:3] -0.0593 -0.9978 -0.0308 -0.1101 -0.0241 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "Height" "Weight" "Age"
.. ..$ : chr [1:3] "PC1" "PC2" "PC3"
$ center : Named num [1:3] 73.7 201.3 28.7
..- attr(*, "names")= chr [1:3] "Height" "Weight" "Age"
$ scale : logi FALSE
$ x : num [1:1015, 1:3] 21.46 -13.82 -8.6 -8.74 13.14 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:1015] "1" "2" "3" "4" ...
.. ..$ : chr [1:3] "PC1" "PC2" "PC3"
- attr(*, "class")= chr "prcomp"
```

Let’s look at **rotation** first.

As noted, each principal component (PC) is a linear combination
of the input features. These coefficients are stored in **rotation**:

```
> z$rotation
PC1 PC2 PC3
Height -0.05934555 -0.11013610 -0.99214321
Weight -0.99776279 -0.02410352 0.06235738
Age -0.03078194 0.99362420 -0.10845927
```

For instance,

PC2 = -0.11 Height - 0.02 Weight + 0.99 Age

As noted, those 3 new variables are stored in the **x** component of
**z**. For instance, consider the first person in the dataset:

```
> mlb[1,]
Height Weight Age
74.00 180.00 22.99
```

In the new data, his numbers are:

```
> z$x[1,]
PC1 PC2 PC3
21.458611 -5.201495 -1.018951
```

Let’s check! Since the PCi are linear combinations of the original columns, we can compute them via matrix multiplication. Let’s do so for PC2, say for the first row of the data.

```
# remember, prcomp did centering, so we need it here
> mlbc <- center(mlb)
> mlbc[1,] %*% z$rotation[,2]
[,1]
[1,] -5.201495
```

Ah yes, same as above.

The key properties of PCA are that the PCs

are arranged in order of decreasing variances, and

they are uncorrelated.

The variances (actually standard deviations) are reported in the return
object from **prcomp**:

```
> z$sdev
[1] 20.869206 4.288663 1.911677
```

Yes, (a) holds. Let’s double check, say for PC2:

```
> sd(z$x[,2])
[1] 4.288663
```

Yes.

What about (b)?

```
> cor(z$x)
PC1 PC2 PC3
PC1 1.000000e+00 -1.295182e-16 2.318554e-15
PC2 -1.295182e-16 1.000000e+00 2.341867e-16
PC3 2.318554e-15 2.341867e-16 1.000000e+00
```

Yes indeed, those new columns are uncorrelated.

The reader of this document has probably seen properties (a) and (b)
before. *But why are they so important?*

Many data analysts, e.g. social scientists, use PCA to search for patterns in the data. In the ML context, though, our main interest is prediction.

Our focus:

If we have a large number of predictor variables, we would like to reduce that number, in order to avoid overfitting, reduce computation and so on. PCA can help us do that. Properties (a) and (b) will play a central role in this.

Toward that end, we will first introduce another dataset, and then
discuss *dimension reduction*–reducing the number of predictor
variables–in the context of that data. That will lead us to the
importance of properties (a) and (b).

Here we will use another built-in dataset in **qeML**, a song database
named **fiftyksongs**. It is a 50,000-row random subset of the famous Million
Song Dataset.

The first column of the data set is the year of release of the song, while the other 90 are various audio measurements. The goal is to predict the year.

```
> dim(fiftyksongs)
[1] 50000 91
> w <- prcomp(fiftyksongs[,-1]) # first column is "Y", to be predicted
> w$sdev
[1] 2127.234604 1168.717654 939.840843 698.575319 546.683262 464.683454
[7] 409.785038 395.928095 380.594444 349.489142 333.322277 302.017413
[13] 282.819445 260.362550 255.472674 248.401464 235.939740 231.404983
[19] 220.682026 194.828458 193.645669 189.074051 187.455170 180.727969
[25] 173.956554 166.733909 156.612298 151.194556 144.547790 138.820897
[31] 133.966493 124.514162 122.785528 115.486330 112.819657 110.379903
[37] 109.347994 106.551231 104.787668 99.726851 99.510556 97.599960
[43] 93.161508 88.559160 87.453436 86.870468 82.452985 80.058511
[49] 79.177031 75.105451 72.542646 67.696172 64.079955 63.601079
[55] 61.105579 60.104226 56.656737 53.166604 52.150838 50.515730
[61] 47.954210 47.406341 44.272814 39.914361 39.536682 38.653450
[67] 37.228741 36.007748 34.192456 29.523751 29.085855 28.387604
[73] 26.325406 24.763188 22.192984 20.203667 19.739706 18.453111
[79] 14.238237 13.935897 10.813426 9.659868 8.938295 7.725284
[85] 6.935969 6.306459 4.931680 3.433850 3.041469 1.892496
```

One hopes to substantially reduce the number of predictors from 90. But how many should we retain? And which ones?

There are 2^{90} possible sets of predictors to use. It of
And given the randomness and the huge number of sets, the odds are
high that the “best”-predicting set is just an accident, not the actual
best and maybe not even close to the best.
course would be out of the question to check them all. So, we might
consider the below predictor sets, say following the order of the
features (which are named **V2**, **V3**, **V4**,…)

**V2** alone;
**V2** and **V3**;
**V2** and **V3** and **V4**;
**V2** and **V3** and **V4** and **V5**;
etc.

Now we have only 90 predictor sets to check–a lot, but far better than
2^{90}. Yet there are two problems that would arise:

Presumably the

**Vi**are not arranged in order of importance as While the set**V12**,**V13**,…,**V88**would include these three variables, we may risk overfitting, masking the value of these three.. predictors. What if, say,**V12**,**V28**and**V88**make for an especially powerful predictor set? The scheme considered here would never pick that up.Possibility of substantial duplication: What if, say,

**V2**and**V3**are very highly correlated? Then once**V2**is in our predictor set, we probably would not want to include**V3**; we are trying to find a parsimonious predictor set, and inclusion of (near-)duplicates would defeat the purpose. Our second candidate set above would be**V2**and**V4**, the third would be**V2**and**V4**and**V5**; and so on. We may wish to skip some other**Vi**as well. Checking for such correlation at every step would be cumbersome and time-consuming.

Both problems are addressed by using the PCs **Pi** instead of the
original variables **Vi**. We then consider these predictor sets,

**P1** alone;
**P1** and **P2**;
**P1** and **P2** and **P3**;
**P1** and **P2** and **P3** and **P4**;
etc.

What does this buy us?

Recall that Var(P

_{i}) is decreasing in i (technically nonincreasing). For large i, Var(P_{i}) is typically tiny; in the above example, for instance, Var(P49) / Var(P1) is only about 0.0014. And a random variable with small variance is essentially constant, thus of no value as a predictor.By virtue of their uncorrelated nature, the

**Pi**basically do not duplicate each other. While it is true that uncorrelatedness does not necessarily imply independence, again we have a reasonable solution to the duplication problem raised earlier.

Again, PCA forms new variables that are *linear* functions of the
original ones. That can be quite useful, but possibly constraining.
In recent years, other dimension reduction method have become popular,
notably t-SNE and UMAP. Let’s take a very brief look at the latter.

```
library(umap)
mypars <- umap.defaults
mypars$n_components <- 6
umOut <- umap(fiftyksongs[,-1],
config=mypars)
```

The new variables will then be returned in **umOut\(layout**, analogous to our **z\)x** above. We will now have a 50000 x 6 matrix, replacing
our original 50000 x 90 data.

So, what does UMAP actually do? The math is quite arcane; even the basic assumption, “uniform distribution on a manifold,” is beyond the scope of this tutorial.

But roughly speaking, the goal is to transform the original data,
dimension 90 here, to a lower-dimensional data set (6 here) in such a
way that “local” structure is retained. The latter condition means that
rows in the data that were neighbors of each other in the original data
are likely still neighbors in the new data, subject to the
hyperparameter **n_neighbors**: a data point **v** counts as a neighbor
of **u** only if **v** is among the **n_neighbors** points to **u**.

In terms of the Bias-Variance Tradeoff, smaller values of
**n_neighbors** reduce bias while increasing variance, in a similar
manner to the value of **k** in the k-Nearest Neighbors predictive
method.