Skip to main content

Section 7.4 Principal Component Analysis

We are sometimes presented with a dataset that includes many variables for each observational unit. When this happens, the case vectors (rows of the column-variate data matrix or columns of the row-variate data matrix) live in a high dimensional space. For instance, we looked at a dataset describing body fat index (BFI) in Activity 6.5.4 where each case vector is six-dimensional. Developing an intuitive understanding of such data is hampered by the fact that it is challenging to visualize.
This section explores a technique called principal component analysis, which enables us to reduce the dimension of a dataset so that it may be visualized or studied in a way that makes interesting features more readily stand out. Our previous work with variance and the orthogonal diagonalization of symmetric matrices provides the key ideas.

Preview Activity 7.4.1.

We will begin by recalling our earlier discussion of variance. Suppose we have a dataset that leads to the covariance matrix
\begin{equation*} S = \begin{bmatrix} 7 \amp -4 \\ -4 \amp 13 \end{bmatrix}. \end{equation*}
  1. Suppose that \(\uvec\) is a unit eigenvector of \(S\) with eigenvalue \(\lambda\text{.}\) What is the variance \(V_{\uvec}\) in the \(\uvec\) direction?
  2. Find an orthogonal diagonalization of \(S\text{.}\)
  3. What is the total variance?
  4. In which direction is the variance greatest and what is the variance in this direction? If we project the data onto this line, how much variance is lost?
  5. In which direction is the variance smallest and how is this direction related to the direction of maximum variance?
Solution.
  1. \(V_{\uvec} = \uvec\cdot(S\uvec) = \lambda\uvec\cdot\uvec = \lambda\text{.}\)
  2. We can write \(S=QDQ^{\transpose}\) where
    \begin{equation*} D=\begin{bmatrix} 15 \amp 0 \\ 0 \amp 5 \\ \end{bmatrix},~~~ Q = \begin{bmatrix} \frac1{\sqrt{5}} \amp \frac2{\sqrt{5}} \\ -\frac2{\sqrt{5}} \amp \frac1{\sqrt{5}} \\ \end{bmatrix}. \end{equation*}
  3. The total variance is the sum of the eigenvalues, \(V=\lambda_1 + \lambda_2 = 15 + 5 = 20\text{.}\)
  4. The variance is greatest in the direction of the eigenvector associated to the largest eigenvalue. This direction is defined by \(\twovec{\frac{1}{\sqrt{5}}}{-\frac2{\sqrt{5}}}\text{,}\) and the variance is 15 in this direction.
  5. The variance is smallest in the direction defined by \(\twovec{\frac2{\sqrt{5}}}{\frac1{\sqrt{5}}}\text{.}\)
Here are some ideas we’ve seen previously that will be particularly useful for us in this section. Remember that the covariance matrix of a dataset is \(S=\frac 1{n-1} X^{\transpose}X\) where \(X\) is the column-variate matrix of \(n\) demeaned data.
  • When \(\uvec\) is a unit vector, the variance of the demeaned data after projecting onto the line defined by \(\uvec\) is given by the quadratic form \(V_{\uvec} = \uvec\cdot(S\uvec)\text{.}\)
  • In particular, if \(\uvec\) is a unit eigenvector of \(S\) with associated eigenvalue \(\lambda\text{,}\) then \(V_{\uvec} = \lambda\text{.}\)
  • Moreover, variance is additive, as we recorded in Proposition 7.1.10: if \(W\) is a subspace having an orthonormal basis \(\uvec_1,\uvec_2,\ldots,\uvec_n\text{,}\) then the variance
    \begin{equation*} V_W = V_{\uvec_1} + V_{\uvec_2} + \ldots + V_{\uvec_n}\text{.} \end{equation*}

Subsection 7.4.1 Themes of Principal Component Analysis

Let’s begin by looking at an example that illustrates the central theme of this technique.

Activity 7.4.2.

Suppose that we work with a dataset having 100 observations of 5 variables. The demeaned column-variate data matrix \(X\) is therefore \(100 \by 5\) and leads to the covariance matrix \(S=\frac1{99}~X^{\transpose}X\text{,}\) which is a \(5\by5\) matrix. Because \(S\) is symmetric, the Spectral Theorem tells us it is orthogonally diagonalizable so suppose that \(S = QDQ^{\transpose}\) where
\begin{equation*} Q = \begin{bmatrix} \uvec_1 \amp \uvec_2 \amp \uvec_3 \amp \uvec_4 \amp \uvec_5 \end{bmatrix},\hspace{24pt} D = \begin{bmatrix} 13 \amp 0 \amp 0 \amp 0 \amp 0 \\ 0 \amp 10 \amp 0 \amp 0 \amp 0 \\ 0 \amp 0 \amp 2 \amp 0 \amp 0 \\ 0 \amp 0 \amp 0 \amp 0 \amp 0 \\ 0 \amp 0 \amp 0 \amp 0 \amp 0 \end{bmatrix}. \end{equation*}
  1. What is \(V_{\uvec_2}\text{,}\) the variance in the \(\uvec_2\) direction?
  2. Find the variance of the data projected onto the line defined by \(\uvec_4\text{.}\) What does this say about the data?
  3. What is the total variance of the data?
  4. Consider the 2-dimensional subspace spanned by \(\uvec_1\) and \(\uvec_2\text{.}\) If we project the data onto this subspace, what fraction of the total variance is represented by the variance of the projected data?
  5. How does this question change if we project onto the 3-dimensional subspace spanned by \(\uvec_1\text{,}\) \(\uvec_2\text{,}\) and \(\uvec_3\text{?}\)
  6. What does this tell us about the data?
Answer.
  1. \(\displaystyle 10\)
  2. \(0\text{,}\) which tells us every case vector is in the orthogonal complement of \(\uvec_4\text{.}\)
  3. \(\displaystyle 25\)
  4. \(92\%\) of the variance
  5. \(100\%\) of the variance.
  6. All of the data lies in the \(3\)-dimensional subspace spanned by \(\uvec_1\text{,}\) \(\uvec_1\text{,}\) and \(\uvec_1\text{.}\)
Solution.
  1. \(\displaystyle V_{\uvec_2} = \lambda_2 = 10\)
  2. \(V_{\uvec_4} = \lambda_4 = 0\text{,}\) which tells us there is no variance in the \(\uvec_4\) direction. Therefore, when we project onto the line defined by \(\uvec_4\text{,}\) every case vector projects to \(\zerovec\) so every case vector is in the orthogonal complement of \(\uvec_4\text{.}\)
  3. \(V = V_{\uvec_1} + V_{\uvec_2} + V_{\uvec_3} + V_{\uvec_4} + V_{\uvec_5} = 13+10+2+0+0 = 25\text{.}\)
  4. The variance of the data projected onto this subspace is \(13+10=23\text{,}\) which represents \(23/25=92\%\) of the variance.
  5. Projecting onto this 3-dimensional subspace retains all of the variance.
  6. All of the data lies in the \(3\)-dimensional subspace spanned by \(\uvec_1\text{,}\) \(\uvec_1\text{,}\) and \(\uvec_1\text{.}\)
This activity demonstrates how the eigenvalues of the covariance matrix can tell us when data are clustered around, or even wholly contained within, a smaller dimensional subspace. In particular, the original data is 5-dimensional, but we see that it actually lies in a 3-dimensional subspace of \(\real^5\text{.}\) Later in this section, we’ll see how to use this observation to work with the data as if it were three-dimensional, an idea known as dimensional reduction.
The eigenvectors \(\uvec_j\) of the covariance matrix are called principal components, and we will order them so that their associated eigenvalues decrease. Generally speaking, we hope that the first few principal components retain most of the variance, as the example in the activity demonstrates. In that example, we have the sequence of subspaces
  • \(W_1\text{,}\) the 1-dimensional subspace spanned by \(\uvec_1\text{,}\) which retains \(13/25 = 52\%\) of the total variance,
  • \(W_2\text{,}\) the 2-dimensional subspace spanned by \(\uvec_1\) and \(\uvec_2\text{,}\) which retains \(23/25 = 92\%\) of the variance, and
  • \(W_3\text{,}\) the 3-dimensional subspace spanned by \(\uvec_1\text{,}\) \(\uvec_2\text{,}\) and \(\uvec_3\text{,}\) which retains all of the variance.
Notice how we retain more of the total variance as we increase the dimension of the subspace onto which the data are projected. Eventually, projecting the data onto \(W_3\) retains all the variance, which tells us the data must lie in \(W_3\text{,}\) a smaller dimensional subspace of \(\real^5\text{.}\)
In fact, these subspaces are the best possible. We know that the first principal component \(\uvec_1\) is the eigenvector of \(S\) associated to the largest eigenvalue. This means that the variance is as large as possible in the \(\uvec_1\) direction. In other words, projecting onto any other line will retain a smaller amount of variance. Similarly, projecting onto any other 2-dimensional subspace besides \(W_2\) will retain less variance than projecting onto \(W_2\text{.}\) The principal components have the wonderful ability to pick out the best possible subspaces to retain as much variance as possible.
Of course, this is a contrived example. Typically, the presence of noise in a dataset means that we do not expect all the points to be wholly contained in a smaller dimensional subspace. One situation where this does occur, however, is when some of the variables in the data set are computed as linear combinations of other variables. This could happen, for example, if our data had measurements in two different units (say, Fahrenheit and Celsius temperatures) or if the data included both subtotals and totals in a sum (e.g., before tax bill, tax, and total with tax included).
In this exaple, the 2-dimensional subspace \(W_2\) retains \(92\%\) of the variance. Depending on the situation, we may want to write off the remaining \(8\%\) of the variance as noise in exchange for the convenience of working with a smaller dimensional subspace. As we’ll see later, we will seek a balance using a number of principal components large enough to retain most of the variance but small enough to be easy to work with.

Activity 7.4.3.

We will work here with a demeaned dataset having 100 observations of 3 variables. Evaluating the following cell will create the demeaned column-variate data matrix \(X\) and plot the data as a 3-d scatter plot.
Notice that the data appears to cluster around a plane though it does not seem to be wholly contained within that plane.
  1. Use the matrix X to construct the covariance matrix \(S\text{.}\) Then determine the variance in the direction of \(\uvec=\threevec{1/3}{2/3}{2/3}\text{?}\)
  2. Find the eigenvalues of \(S\) and determine the total variance.
    Notice that Python does not necessarily sort the eigenvalues in decreasing order.
  3. Use the numpy.linalg.eig() command to find the eigenvectors of \(S\text{.}\) Define vectors u1, u2, and u3 representing the three principal components in order of decreasing eigenvalues. How can you check if these vectors are an orthonormal basis for \(\real^3\text{?}\)
  4. What fraction of the total variance is retained by projecting the data onto \(W_1\text{,}\) the subspace spanned by \(\uvec_1\text{?}\) What fraction of the total variance is retained by projecting onto \(W_2\text{,}\) the subspace spanned by \(\uvec_1\) and \(\uvec_2\text{?}\) What fraction of the total variance do we lose by projecting onto \(W_2\text{?}\)
  5. Each column of \(X^{\transpose}\) (each row of \(X\)) represents one observational unit in our data. We will refer to these as case vectors. In a traditional scatter plot (or cloud plot in 3 dimensions), each case vector is represented by one of the dots. If we project each case vector \(\xvec\) onto \(W_2\text{,}\) the Projection Formula tells us we obtain
    \begin{equation*} \xhat = (\uvec_1\cdot\xvec) \uvec_1 + (\uvec_2\cdot\xvec) \uvec_2. \end{equation*}
    Rather than viewing the projected data in \(\real^3\text{,}\) we will record the coordinates of \(\xhat\) in the basis defined by \(\uvec_1\) and \(\uvec_2\text{;}\) that is, we will record the coordinates
    \begin{equation*} \twovec{\uvec_1\cdot\xvec}{\uvec_2\cdot\xvec}. \end{equation*}
    Construct the matrix \(Q\) so that \(Q^{\transpose}\xvec = \twovec{\uvec_1\cdot\xvec}{\uvec_2\cdot\xvec}\text{.}\)
  6. Since each column of \(X^{\transpose}\) represents one observational unit, the matrix \(Q^{\transpose}X^{\transpose}\) is a row-variate representation of the data projected onto a lower-dimensional subspace identified by PCA. We can transpose again to get the column-variate represention \(X Q\text{.}\)
    Notice how this plot enables us to view the data as if it were two-dimensional. Why is this plot wider than it is tall?
  7. ScikitLearn provides another way to compute the first (highest-variance) specified number of principle componenets from a data matrix.
Answer.
  1. \(\displaystyle V_{\uvec} = 7885\)
  2. \(\displaystyle V=12195\)
  3. If \(P\) is the matrix of eigenvectors, evaluate \(P^{\transpose}P\text{.}\)
  4. \(W_1\) retains \(83\%\) of the total variance, and \(W_2\) retains \(98\%\text{.}\)
  5. \(\displaystyle Q=\begin{bmatrix}\uvec_1\amp\uvec_2\end{bmatrix}\)
  6. Because the variance in the \(\uvec_1\) direction is greater than the variance in the \(\uvec_2\) direction.
Solution.
  1. After constructing the covariance matrix \(S = \frac{1}{99}X^{\transpose}X\text{,}\) we find that \(V_{\uvec} = \uvec\cdot(S\uvec) \approx 7965\text{.}\)
  2. The total variance \(V\) is the sum of the eigenvalues of \(S_{XX}\) so we obtain \(V=12195\text{.}\)
  3. If we obtain \(P\text{,}\) the matrix of eigenvectors, from Python, computing \(P^{\transpose}P\) evaluates the dot products between the columns. Since \(P^{\transpose}P=I\text{,}\) the basis provided by Sage is orthonormal.
  4. Projecting onto \(W_1\text{,}\) we see that \(\lambda_1/V = 0.83\) so \(W_1\) retains about \(83\%\) of the total variance. The subspace \(W_2\) retains \((\lambda_1+\lambda_2)/V=0.98\) or \(98\%\) of the total variance. If we project onto \(W_2\) we lose less than \(2\%\) of the variance.
  5. \(\displaystyle Q=\begin{bmatrix}\uvec_1\amp\uvec_2\end{bmatrix}\)
  6. The plot is wider because the variance in the \(\uvec_1\) direction, which corresponds to the horizontal coordinate, is greater than the variance in the \(\uvec_2\) direction.
This example is a more realistic illustration of principal component analysis. The plot of the 3-dimensional data appears to show that the data lies close to a plane, and the principal components will identify this plane. Starting with the \(100\by3\) matrix of demeaned data \(X\text{,}\) we construct the covariance matrix \(S=\frac{1}{99} ~X^{\transpose}X\) and study its eigenvalues. Notice that the first two principal components account for more than 98% of the variance, which means we can expect the case vectors to lie close to \(W_2\text{,}\) the two-dimensional subspace spanned by \(\uvec_1\) and \(\uvec_2\text{.}\)
Since \(W_2\) is a subspace of \(\real^3\text{,}\) projecting the case vectors onto \(W_2\) gives a list of 100 points in \(\real^3\text{.}\) In order to visualize them more easily, we instead consider the coordinates of the projections in the basis defined by \(\uvec_1\) and \(\uvec_2\text{.}\) For instance, we know that the projection of a case vector \(\xvec\) is
\begin{equation*} \xhat = (\uvec_1\cdot\xvec)\uvec_1 + (\uvec_2\cdot\xvec)\uvec_2, \end{equation*}
which is a three-dimensional vector. Instead, we can record the coordinates \(\twovec{\uvec_1\cdot\xvec}{\uvec_2\cdot\xvec}\) and plot them in the two-dimensional coordinate plane, as illustrated in Figure 7.4.1.
Figure 7.4.1. The projection \(\xhat\) of a case vector \(\xvec\) onto \(W_2\) is a three-dimensional vector, which may be represented by the two coordinates describing this vector as a linear combination of \(\uvec_1\) and \(\uvec_2\text{.}\)
If we form the matrix \(Q=\begin{bmatrix}\uvec_1 \amp \uvec_2 \end{bmatrix}\text{,}\) then we have
\begin{equation*} Q^{\transpose}\xvec = \twovec{\uvec_1\cdot\xvec}{\uvec_2\cdot\xvec}. \end{equation*}
This means that the columns of \(Q^{\transpose}X^{\transpose}\) represent the coordinates of the projected case vectors, which may now be plotted in the plane. Transposing again gives the column-variate version of the PCA-projected data: \(XQ\text{.}\)
In this plot, the first coordinate, represented by the horizontal coordinate, represents the projection of a case vector onto the line defined by \(\uvec_1\) while the second coordinate represents the projection onto the line defined by \(\uvec_2\text{.}\) Since \(\uvec_1\) is the first principal component, the variance in the \(\uvec_1\) direction is greater than the variance in the \(\uvec_2\) direction. For this reason, the plot will be more spread out in the horizontal direction than in the vertical.

Subsection 7.4.2 Using Principal Component Analysis

Now that we’ve explored the ideas behind principal component analysis, we will look at a few examples that illustrate its use.

Activity 7.4.4.

The next cell will load a dataset describing the average consumption of various food groups for citizens in each of the four nations of the United Kingdom. The units for each entry are grams per person per week.
We will view this as a dataset consisting of four case vectors in \(\real^{17}\text{.}\) Since we have 17 variables measured for each observation, we can’t easily visualize this with a scatter plot, which would consist of four points in 17-dimensional space. Studying the numbers themselves doesn’t lead to much insight either.
In addition to loading the data, evaluating the cell above created a vector data_mean, which is the componentwise mean of the four case vectors, and FoodX, the \(4 \by 17\) column-variate matrix of demeaned data.
  1. What is the average consumption of Beverages across the four nations?
  2. Find the covariance matrix \(S\) and its eigenvalues. Because there are four points in \(\real^{17}\) whose mean is zero, there are only three nonzero eigenvalues.
  3. For what percentage of the total variance does the first principal component account?
  4. Find the first principal component \(\uvec_1\) and project the four demeaned case vectors onto the line defined by \(\uvec_1\text{.}\) Plot those vectors as points on Figure 7.4.2
    Figure 7.4.2. A plot of the demeaned data projected onto the first principal component.
  5. For what percentage of the total variance do the first two principal components account?
  6. Find the coordinates of the demeaned case vectors projected onto \(W_2\text{,}\) the two-dimensional subspace of \(\real^{17}\) spanned by the first two principal components.
    Plot these coordinates in Figure 7.4.3.
    Figure 7.4.3. The coordinates of the demeaned case vectors projected onto the first two principal components.
  7. What information do these plots reveal that is not clear from consideration of the original case vectors?
  8. Study the first principal component \(\uvec_1\) and find the first component of \(\uvec_1\text{,}\) which corresponds to the dietary category Alcoholic Drinks. (To do this, you may wish to use N(u1, digits=2) for a result that’s easier to read.) If a case vector lies on the far right side of the plot in Figure 7.4.3, what does it mean about that nation’s consumption of Alcoholic Drinks?
Answer.
  1. \(\displaystyle 57.5\)
  2. \(78805\text{,}\) \(33946\text{,}\) and \(4093\)
  3. \(\displaystyle 67\%\)
  4. The coordinates are
    Nation Coordinate
    England \(-145\)
    Northern Ireland \(477\)
    Scotland \(-92\)
    Wales \(-241\)
  5. \(\displaystyle 96\%\)
  6. The coordinates are
    Nation Coordinates
    England \((-145, 3)\)
    Northern Ireland \((477, 59)\)
    Scotland \((-92, -286)\)
    England \((-241, 225)\)
  7. Northern Ireland appears to be significantly different from the other three nations.
  8. The average consumption of Alcoholic Drinks will be less than the mean.
Solution.
  1. Beverages is the second category so this would be the second component of the data_mean vector, which is \(57.5\text{.}\)
  2. The three nonzero eigenvalues are \(78805\text{,}\) \(33946\text{,}\) and \(4093\text{.}\)
  3. The total variance \(V=116844\) is the sum of the eigenvalues so the first principal component accounts for \(\lambda_1/V = 67\%\) of the total variance.
  4. The coordinates are
    Nation Coordinate
    England \(-145\)
    Northern Ireland \(477\)
    Scotland \(-92\)
    Wales \(-241\)
  5. The first two principal components account for \(96\%\) of the total variance.
  6. The coordinates are
    Nation Coordinates
    England \((-145, 3)\)
    Northern Ireland \((477, 59)\)
    Scotland \((-92, -286)\)
    England \((-241, 225)\)
  7. Northern Ireland appears to be significantly different from the other three nations. There are several possible reasons for this, both historical and geographical, that we might explore.
  8. The first component of \(\uvec_1\) is negative. Therefore, if a nation is on the right side of this plot, the average consumption of Alcoholic Drinks will be less than the mean. This can be confirmed by looking at the original data.
This activity demonstrates how principal component analysis enables us to extract information from a dataset that may not be easily obtained otherwise. As in our previous example, we see that the case vectors lie quite close to a two-dimensional subspace of \(\real^{17}\text{.}\) In fact, \(W_2\text{,}\) the subspace spanned by the first two principal components, accounts for more than 96% of the variance. More importantly, when we project the data onto \(W_2\text{,}\) it becomes apparent that Northern Ireland is fundamentally different from the other three nations.
With some additional thought, we can determine more specific ways in which Northern Ireland is different. On the \(2\)-dimensional plot, Northern Ireland lies far to the right compared to the other three nations. Since the data has been demeaned, the origin \((0,0)\) in this plot corresponds to the average of the four nations. The coordinates of the point representing Northern Ireland are about \((477, 59)\text{,}\) meaning that the projected data point differs from the mean by about \(477\uvec_1+59\uvec_2\text{.}\)
Let’s just focus on the contribution from \(\uvec_1\text{.}\) We see that the ninth component of \(\uvec_1\text{,}\) the one that describes Fresh Fruit, is about \(-0.63\text{.}\) This means that the ninth component of \(477\uvec_1\) differs from the mean by about \(477(-0.63) = -300\) grams per person per week. So roughly speaking, people in Northern Ireland are eating about 300 fewer grams of Fresh Fruit than the average across the four nations. This is borne out by looking at the original data, which show that the consumption of Fresh Fruit in Northern Ireland is significantly less than in the other nations. Examing the other components of \(\uvec_1\) shows other ways in which Northern Ireland differs from the other three nations.

Activity 7.4.5.

In this activity, we’ll look at a well-known dataset 1  that describes 150 irises representing three species of iris: iris setosa, iris versicolor, and iris virginica. For each flower, the length and width of its sepal and the length and width of its petal, all in centimeters, are recorded.
Figure 7.4.4. One of the three species, iris versicolor, represented in the dataset showing three shorter petals and three longer sepals. (Source: Wikipedia 2 , License: GNU Free DOcumetation License 3 )
Evaluating the following cell will load the dataset, which consists four physcial measuremnts for each of 150 iris plants. In addition, we compute a vector data_mean, a four-dimensional vector holding the means of the four measurements (which is the same as the mean of the 150 case vectors), and irisX, the \(150 \by 4\) demeaned data matrix.
Since the data is four-dimensional, we are not able to visualize it easily. Of course, we could forget about two of the measurements and plot a scatter plot of just two of the four variables, say, just the sepal length and sepal width.
  1. What is the mean sepal width?
  2. Find the covariance matrix \(S\) and its eigenvalues.
  3. Find the fraction of variance for which the first two principal components account.
  4. Construct the first two principal components \(\uvec_1\) and \(\uvec_2\) along with the matrix \(Q\) whose columns are \(\uvec_1\) and \(\uvec_2\text{.}\)
  5. As we have seen, the columns of the matrix \(Q^{\transpose}X^{\transpose}\) or rows of \(XQ\) hold the coordinates of the demeaned case vectors after projecting onto \(W_2\text{,}\) the subspace spanned by the first two principal components. Evaluating the following cell shows a plot of these coordinates.
    Suppose we have a flower whose coordinates in this plane are \((-2.5, -0.75)\text{.}\) To what species does this iris most likely belong? Find an estimate of the sepal length, sepal width, petal length, and petal width for this flower.
  6. Suppose you have an iris, but you only know that its sepal length is 5.65 cm and its sepal width is 2.75 cm. Knowing only these two measurements, determine the coordinates \((c_1, c_2)\) in the plane where this iris lies. To what species does this iris most likely belong? Now estimate the petal length and petal width of this iris.
  7. Suppose you find another iris whose sepal width is 3.2 cm and whose petal width is 2.2 cm. Find the coordinates \((c_1, c_2)\) of this iris and determine the species to which it most likely belongs. Also, estimate the sepal length and the petal length.
Answer.
  1. \(3.05\text{.}\)
  2. \(4.20\text{,}\) \(0.24\text{,}\) \(0.08\text{,}\) \(0.02\text{.}\)
  3. \(\displaystyle 97.8\%\)
  4. The columns of \(Q\) are for the first two principal components.
  5. Iris setosa and the vector of measurements is \(\fourvec{5.43}{3.81}{1.49}{0.25}\text{.}\)
  6. The petal length is \(3.99\) and the petal width is \(1.29\text{.}\)
  7. The sepal length is \(7.23\) and the petal length is \(6.15\text{.}\)
Solution.
  1. The second component of data_mean, which is the one corresponding to sepal width, is \(3.05\text{.}\)
  2. The eigenvalues are \(4.20\text{,}\) \(0.24\text{,}\) \(0.08\text{,}\) and \(0.02\text{.}\)
  3. The first two principal components account for \(97.8\%\) of the variance.
  4. If \(P\) is the matrix whose columns are an orthonormal basis of eigenvectors, then \(Q\) is formed from the first two columns of \(P\text{.}\)
  5. This would most likely belong to iris setosa. To find its measurements, we evaluate \(-2.5\uvec_1 - 0.75\uvec_2 + \mvec\) where \(\mvec\) is the vector of means. This is the same as \(Q\twovec{-2.5}{-0.75} + \mvec\text{,}\) which gives the vector of measurements \(\fourvec{5.43}{3.81}{1.49}{0.25}\text{.}\)
  6. Subtracting the mean sepal length and sepal width, we have \((-0.19, -0.30)\text{.}\) Then the first two components of \(c_1\uvec_1+c_2\uvec_2 = Q\twovec{c_1}{c_2} = \twovec{-0.19}{-0.30}\text{.}\) This gives \((c_1, c_2) = (0.18, 0.40)\text{.}\) This looks like an iris versicolor. As in the previous part, we can now find the petal length to be \(3.99\) and the petal width to be \(1.29\text{.}\)
  7. Using the same approach as the last part, we find \((c_1,c_2)=(2.90, -0.53)\text{,}\) which gives a sepal length of \(7.23\) and a petal length of \(6.15\text{.}\) Most likely, this flower belongs to iris virginica.

Subsection 7.4.3 Summary

This section has explored principal component analysis (PCA) as a technique to reduce the dimension of a dataset. From the demeaned column-variate data matrix \(X\text{,}\) we form the covariance matrix \(S_{XX} = \frac1{n-1} X^{\transpose}X\text{,}\) where \(n\) is the number of observational units.
  • The eigenvectors \(\uvec_1, \uvec_2, \ldots \uvec_m\text{,}\) of \(S_{XX}\) are called the principal components. We arrange them so that their corresponding eigenvalues are in decreasing order.
  • If \(W_p\) is the subspace spanned by the first \(p\) principal components, then the variance of the demeaned data projected onto \(W_p\) is the sum of the first \(p\) eigenvalues of \(S_{XX}\text{.}\) No other \(p\)-dimensional subspace retains more variance when the data are projected onto it.
  • If \(Q\) is the matrix whose columns are the first \(p\) principal components, then \(XQ\) contains the column-variate data matrix, expressed in the basis \(\uvec_1,\ldots,\uvec_p\text{,}\) of the data once projected onto \(W_p\text{.}\)
  • Our goal is to use a number of principal components that is large enough to retain most of the variance in the dataset but small enough to be manageable.
  • The advantage of principal components analysis is the resulting data reudction. The primary disadvantage is the prinicpal components may not be easily or naturally interpretable.

Exercises 7.4.4 Exercises

1.

Suppose that
\begin{equation*} Q = \begin{bmatrix} -1/\sqrt{2} \amp 1/\sqrt{2} \\ 1/\sqrt{2} \amp 1/\sqrt{2} \\ \end{bmatrix}, \hspace{24pt} D_1 = \begin{bmatrix} 75 \amp 0 \\ 0 \amp 74 \end{bmatrix}, \hspace{24pt} D_2 = \begin{bmatrix} 100 \amp 0 \\ 0 \amp 1 \end{bmatrix} \end{equation*}
and that we have two datasets, one whose covariance matrix is \(S_1 = QD_1Q^{\transpose}\) and one whose covariance matrix is \(S_2 = QD_2Q^{\transpose}\text{.}\) For each dataset, find
  1. the total variance.
  2. the fraction of variance represented by the first principal component.
  3. a verbal description of how the demeaned case vectors appear when plotted as points in the plane.

2.

Suppose that a dataset has mean \(\threevec{13}{5}{7}\) and that its associated covariance matrix is \(S=\begin{bmatrix} 275 \amp -206 \amp 251 \\ -206 \amp 320 \amp -206 \\ 251 \amp -206 \amp 275 \end{bmatrix} \text{.}\)
  1. What fraction of the variance is represented by the first two principal components?
  2. Suppose \(\threevec{30}{-3}{26}\) is one of the case vectors. Find the coordinates when the demeaned case vector is projected into the plane defined by the first two principal components.
  3. If a projected case vector has coordinates \(\twovec{12}{-25}\text{,}\) find an estimate for the original case vector. Why is it only an estimate? What factors determine how good the estimate is likely to be?

3.

Determine whether the following statements are true or false and explain your thinking.
  1. If the eigenvalues of the covariance matrix are \(\lambda_1\text{,}\) \(\lambda_2\text{,}\) and \(\lambda_3\text{,}\) then \(\lambda_3\) is the variance of the demeaned case vectors when projected on the third principal component \(\uvec_3\text{.}\)
  2. Principal component analysis always allows us to construct a smaller dimensional representation of a dataset without losing any information.
  3. If the eigenvalues of the covariance matrix are 56, 32, and 0, then the demeaned case vectors all lie on a line in \(\real^3\text{.}\)

4.

In Activity 7.4.5, we looked at a dataset consisting of four measurements of 150 irises. These measurements are sepal length, sepal width, petal length, and petal width.
  1. Find the first principal component \(\uvec_1\) and describe the meaning of its four components. Which component is most significant? What can you say about the relative importance of the four measurements? (See Exercise 7.4.4.5 for an important caveat about this line of reasoning.)
  2. When the dataset is plotted in the plane defined by \(\uvec_1\) and \(\uvec_2\text{,}\) the specimens from the species iris-setosa lie on the left side of the plot. What does this tell us about how iris-setosa differs from the other two species in the four measurements?
  3. In general, which species is closest to the “average iris”?

5.

This problem explores a dataset describing 333 penguins. There are three species, Adelie, Chinstrap, and Gentoo, as illustrated on the left of Figure 7.4.5, as well as both male and female penguins in the dataset.
Figure 7.4.5. Artwork by @allison_horst 4 
Evaluating the next cell will load and display the data. The meaning of the culmen length and width is contained in the illustration on the right of Figure 7.4.5.
This dataset is a bit different from others that we’ve looked at because the scale of the measurements is significantly different. For instance, the measurements for the body mass are roughly 100 times as large as those for the culmen length. For this reason, first standardize the data by demeaning it, as usual, and then by also rescaling each measurement by the reciprocal of its standard deviation. The result should be a \(333 \by 4\) matrix penguinsXstd with columns means of 0 and columns standard deviations of 1.
  1. Find the covariance matrix for penguinsXstd and its eigenvalues.
  2. What fraction of the total variance is explained by the first two principal components?
  3. Construct the \(333 \by 2\) matrix penguinsPCA that results from projecting onto the first two principal components. The following cell will create the plot.
  4. Examine the components of the first two principal component vectors. How does the body mass of Gentoo penguins compare to that of the other two species?
  5. What seems to be generally true about the culmen measurements for a Chinstrap penguin compared to a Adelie?
  6. You can separate the males from the females using the following cell.
    What seems to be generally true about the body mass measurements for a male Gentoo compared to a female Gentoo?
archive.ics.uci.edu
gvsu.edu/s/21D
gvsu.edu/s/21E
gvsu.edu/s/21G