Suppose we collect some data when performing an experiment and plot it as shown on the left of Figure 6.5.1. Notice that there is no line on which all the points lie; in fact, it would be surprising if there were since we can expect some uncertainty in the measurements recorded. There does, however, appear to be a line, as shown on the right, on which the points almost lie.
In this section, we’ll explore how the techniques developed in this chapter enable us to find the line that best approximates the data.
More generally, that whenever \(A\xvec=\bvec\) is inconsistent, we can instead seek an approximate solution -- a solution to \(A\xvec=\bhat\) where \(\bhat\) is a close as possible to \(\bvec\text{.}\) Orthogonal projection gives us just the right tool for doing this.
Preview Activity6.5.1.
Is there a solution to the equation \(A\xvec=\bvec\) where \(A\) and \(\bvec\) are such that
Applying Gram-Schmidt, we find an orthogonal basis consisting of \(\wvec_1=\cthreevec12{-1}\) and \(\wvec_2=\threevec012\text{.}\)
The projection formula gives \(\bhat =
\cthreevec0{-1}{-2}\text{.}\)
The equation is consistent because \(\bhat\) is in \(\col(A)\text{.}\) We find the solution \(\xvec=\twovec2{-1}\text{.}\)
Subsection6.5.1A first example
When we’ve encountered inconsistent systems in the past, we’ve simply said there is no solution and moved on. The preview activity, however, shows how we can find approximate solutions to an inconsistent system: if there are no solutions to \(A\xvec = \bvec\text{,}\) we instead solve the consistent system \(A\xvec = \bhat\text{,}\) the orthogonal projection of \(\bvec\) onto \(\col(A)\text{.}\) As we’ll see, this solution is, in a specific sense, the best possible.
Activity6.5.2.
Suppose we have three data points \((1,1)\text{,}\)\((2,1)\text{,}\) and \((3,3)\) and that we would like to find a line passing through them.
Plot these three points in Figure 6.5.2. Are you able to draw a line that passes through all three points?
Remember that the equation of a line can be written as
\begin{equation*}
y = mx + b
\end{equation*}
where \(m\) is the slope and \(b\) is the \(y\)-intercept. Statisticans prefer the notation
\begin{equation*}
y = \beta_0 + \beta_1 x\text{,}
\end{equation*}
and we’re going to adopt statistical preferences for most of the remainder of this chapter since least squares is such an important method in statistics.
To begin, we will try to find \(\beta_0\) and \(\beta_1\) so that the three points lie on the line with equation \(y = \beta_0 + \beta_1 x\text{.}\) The first data point \((1,1)\) gives an equation for \(\beta_0\) and \(\beta_1\text{.}\) In particular, we know that when \(x=1\text{,}\) then \(y=1\) so we have \(\beta_0 + \beta_1(1) = 1\) or \(\beta_0 + \beta_1 = 1\text{.}\)
Use the other two data points to create a linear system describing \(\beta_0\) and \(\beta_1\text{.}\)
We have obtained a linear system having three equations, one from each data point, for the two unknowns \(\beta_0\) and \(\beta_1\text{.}\) Identify a matrix \(X\) and vector \(\betavec\) so that the system has the form \(\yvec = X \betavec\text{,}\) where \(\betavec=\ctwovec{\beta_0}{\beta_1}\text{.}\)
Notice that the unknown vector \(\betavec=\ctwovec {\beta_0}{\beta_1}\) specifies the intercept and slope of the line that we seek.
Is there a solution to this linear system? How does this question relate to your attempt to draw a line through the three points above?
Since this system is inconsistent, we know that \(\yvec\) is not in the column space \(\col(X)\text{.}\) Find an orthogonal basis for \(\col(X)\) and use it to find the orthogonal projection \(\yhat = \proj{\yvec}{\col(X)}\text{.}\)
Since \(\yhat\) is in \(\col(X)\text{,}\) the equation \(X\betavec = \yhat\) is consistent. Find its solution, which we will denote \(\betahat = \ctwovec{\hat\beta_0}{\hat\beta_1}\text{,}\) and sketch the line \(y=\hat{\beta}_0 + \hat{\beta}_1 x\) in Figure 6.5.2. This line is called the least squares regression line. That "hat" on \(\betahat\) indicates that these coefficients (most likely) do not fit the data exactly, but come as close as we can to doing so (in the sense of minimizing the distance between \(\yhat = X\betahat\) and \(\yvec\)).
We have \(X = \begin{bmatrix}
1 \amp 1 \\
1 \amp 2 \\
1 \amp 3 \\
\end{bmatrix}\) and \(\yvec=\threevec113\text{.}\)
Finding the reduced row echelon form of the associated augmented matrix tells us this is an inconsistent system. Since a solution would describe a line passing through the three points, we should expect this.
Applying Gram-Schmidt gives us the orthogonal basis \(\wvec_1=\threevec111\) and \(\wvec_2 =
\threevec{-1}01\text{.}\) Projecting \(\yvec\) onto \(\col(X)\) gives \(\yhat=\threevec{2/3}{5/3}{8/3}\text{.}\)
Solving the equation \(X\betavec = \yhat\) gives \(\betahat=\ctwovec{-1/3}1\text{,}\) which describes a line having vertical intercept \(b=-1/3\) and the slope \(=1\text{.}\) This line is shown in Figure 6.5.4.
This activity illustrates the idea behind a technique known as least squares regression, which we have been working toward throughout this chapter. If the data points are denoted as \((x_i, y_i)\text{,}\) we construct the matrix \(X\) and vector \(\yvec\) as
With the vector \(\betavec=\ctwovec{\beta_0}{\beta_1}\) representing the line with equation \(y = \beta_0 + \beta_1 x\text{,}\) we see that the equation \(\yvec = X\betavec\) describes a line passing through all the data points. In our activity, it is visually apparent that there is no such line, which agrees with the fact that the equation \(\yvec = X\betavec\) is inconsistent.
Remember that \(\yhat = \proj{\yvec}{\col(X)}\) is the closest vector in \(\col(X)\) to \(\yvec\text{.}\) Therefore, when we solve the equation \(X\betavec=\yhat\text{,}\) we are finding the vector \(\betahat\) so that \(\yhat = X\betahat = \threevec{\hat{\beta}_0+ \hat{\beta_1} x_1}{\hat{\beta}_0+ \hat{\beta_1} x_2}{\hat{\beta}_0+ \hat{\beta_1} x_3}\) is as close to \(\yvec=\threevec{y_1}{y_2}{y_3}\) as possible. Let’s think about what this means within the context of this problem.
Our approach finds the values for \(\beta_0\) and \(\beta_1\) that make this sum of squares as small as possible, which is why we call this a least squares problem.
Usually the least squares regression line does not pass through all of the data points. Statisticians call \(y_i - \hat{y}_i = y_i - (\hat\beta_0 + \hat\beta_1 x_i)\) the residual for observation \(i\text{.}\) As shown in Figure 6.5.5, residuals measure the vertical distance between the observed response value \(y_i\) and the predicted response value \(\hat y_i\text{:}\)
Seen in this way, the square of the distance \(\len{\yvec-\hat y}^2\) is a measure of how much the line defined by the vector \(\betahat\) misses the data points. The solution to the least squares problem is the line that misses the data points by the smallest amount possible (when measured in this way).
Subsection6.5.2The linear model framework
The previous example is an example of simple linear regression. In simple linear regression there is a single quantitative predictor and the model proposes a linear relationship between the explanatory variable and the response. Least squares regression can be used with multiple explanatory variables just as easily as with one -- at least if we are willing to let the computer take care of the tedious arithmetic involved. In this section we describe a general framework called linear models. Linear models and their generalizations are arguably the most important and commonly used method of data analysis.
Suppose we are looking for a relationship of the form
We mean for this to hold for every subject in the data, each of whom has (likely different) values of \(y\) and the \(x_i\)’s. But we want the values of the \(\beta's \) to be the same for everyone. Let’s rewrite our equation to emphasize that we are really dealing with vectors here.
Notice that we snuck in a vector of 1’s to make the "constant" term look like all the others.
Now let’s go one more step and express this model using matrices. The vector of ones and the \(\xvec_i\) vectors form the columns of the \(n \by (p+1)\) matrix \(X\text{.}\) This matrix is usually refered to as the model matrix. The coefficients \(\beta_i\) are arranged into a column vector. Then the entire relationship is expressed as a simple equation of the form \(A\xvec = \bvec\text{,}\) but with different letters. And typically statisticians prefer to swap the left and right sides of the equation as well. So the model is exprssed like this:
In this context, \(\col(X)\) is called the model space. It will be important below to note that this means that
\begin{equation*}
\yvec - \yhat \mbox{ is orthogonal to (every vector in) the model space}\text{.}
\end{equation*}
Whether we express our equation as \(A \xvec = \bvec\text{,}\) as we have mostly done to this point, or as \(\yvec = X \betavec\text{,}\) as we will typically do in statistical applications, or using some other letters, the linear algebra is the same (and familiar): orthogonal projection, solving sytems of linear equations, etc.
Example6.5.6.
Add example of setting up a multiple regression problem. Include at least a binary categorical predictor.
We conclude this section with a note about the intercept term.
Note6.5.7.Models without an intercept.
It is possible to fit models without an intecept term. In this case the column of 1’s will be omitted from the model matrix \(X\text{.}\) Algebraically, a few things, like the defnition and interpretation of \(R^2\) below do not work out as well in that case. And statistically, omitting the intercept makes a strong assumption about the nature of the relationship. In most statistical software, the default it to always include an intercept, but there are options to fit models without the intercept term if so desired.
Subsection6.5.3Solving least squares problems
Now that we’ve discussed least squares approximate solutions to \(A \xvec = \bvec\) and seen an important application of this method in linear models, usually expressed as \(\yvec = X \betavec\text{,}\) it is time to turn our attention to some of the details involved in solving least squares problems. We’ll continue with the statistical notation for this.
We already know one way to solve a least squares problem \(\yvec = X \betavec\text{,}\) namely
Project \(y\) into the model space.
Compute \(\yhat = \proj{\yvec}{\col(X)}\text{.}\)
Solve the new equation \(\yhat = X \betavec\) for \(\betavec\).
Because \(\yhat \in \col(X)\text{,}\) we know this equation is consistent. If the columns of \(X\) are linearly independent (as will usually be the case in linear model applications), then there is exactly one solution. We denote the solution to this as \(\betahat\text{.}\) The entries in \(\betahat\) are called the (estiamted) coeffients of the model.
Because we can measure how close \(\yhat\) is to \(\yvec\) using an expression that involves a sum of squares, and \(\betahat\) makes this expression as small as possible, \(\betahat\) is called a least squares approximate solution to the original equation \(\yvec = X\betavec\text{.}\)
That is the method we have outlined above. But there are other methods that are usually used in practice, because they are more efficient and more stable numerically.
We begin by describing a method for finding \(\betahat\) that does not involve first finding the orthogonal projection \(\yhat\text{.}\) Remember that
so \(\yvec - \yhat\) is orthogonal to \(\col(A)\text{.}\) In other words, \(\yvec - \yhat\) is in the orthogonal complement \(\col(A)^\perp\text{,}\) which Proposition 6.2.10 tells us is the same as \(\nul(X^{\transpose})\text{.}\) Since \(\yvec - \yhat\) is in \(\nul(X^{\transpose})\text{,}\) it follows that
This is just another way of writing down that \(\yvec - \yhat\) is orthogonal to each column of \(X\text{.}\)
Because the least squares approximate solution is the vector \(\betahat\) such that \(X\betahat = \yhat\text{,}\) we can rearrange this equation to see that
This equation is called the normal equation, and we have the following proposition.
Proposition6.5.8.
If the columns of \(X\) are linearly independent, then there is a unique least squares approximate solution \(\betahat\) to the equation \(X\betahat=\yvec\) given by the normal equation
Since this equation is inconsistent, we will find the least squares approximate solution \(\betahat\) by solving the normal equation \(X^{\transpose}X\betahat = X^{\transpose}\yvec\text{,}\) which has the form
Solving this yields \(\betahat=\twovec34\text{.}\)
You may wonder why the approach in Example 6.5.9 is better than the original appraoch. Here’s one reason why. Suppose we have a larger example and \(X\) is \(n\by m\) with \(n\) much larger than \(p\text{,}\) as is often the case. Then \(X^{\transpose} X\) is \(p\by p\text{,}\) which is small. But \(X X^\transpose\) is \(n \by n\text{,}\) which is large -- much larger than \(X\text{.}\) So computing \(\yhat = \proj{\yvec}{\col(X)} = X X^{\transpose} \yvec\) is expensive. Working with \(X^\transpose \yvec\) is comparatively much less computationally intensive.
Activity6.5.3.
The rate at which a cricket chirps is related to the outdoor temperature, as reflected in some experimental data that we’ll study in this activity. The chirp rate \(C\) is expressed in chirps per second while the temperature \(T\) is in degrees Fahrenheit. Evaluate the following cell to load the data:
We would like to represent this relationship by a linear function
\begin{equation*}
\beta_0 + \beta_1 C = T\text{.}
\end{equation*}
Use the first data point \((C_1,T_1)=(20.0,88.6)\) to write an equation involving \(\beta_0\) and \(\beta_1\text{.}\)
Suppose that we represent the coefficients using a vector \(\betavec = \twovec{\beta_0}{\beta_1}\text{.}\) Use the 15 data points to create the matrix \(X\) and vector \(\yvec\) so that the linear system \(\yvec = X \betavec\) describes the desired relationship.
that is, find the matrix \(X^{\transpose}X\) and the vector \(X^{\transpose}\yvec\text{.}\)
Solve the normal equations to find \(\betahat\text{,}\) the least squares approximate solution to the equation \(\yvec = X \betavec\text{.}\) Call your solution \(\betahat\text{.}\)
What are the values of \(\beta_0\) and \(\beta_1\) that you found?
If the chirp rate is 22 chirps per second, what is your prediction for the temperature?
Plot the data and your least squares regression line.
\(X\) is the \(15\by2\) matrix whose first column consists only of 1’s and whose second column is the vector of chirp rates. The vector \(\yvec\) is the vector of temperatures.
We have the equation \(\beta_0 + 20.0\beta_1 =
88.6\text{.}\)
\(X\) is the \(15\by2\) matrix whose first column consists only of 1’s and whose second column is the vector of chirp rates. The vector \(\yvec\) is the vector of temperatures.
The predicted temperature is \(\beta_0 + 22\beta_1 = 97.9\) degrees.
Once we have the linear function that best fits the data, we can make predictions about situations that we haven’t encountered in the data. If we’re going to use our function to make predictions, it’s natural to ask how much confidence we have in these predictions. This is a statistical question that leads to a rich and well-developed theory, which we won’t explore in much detail here. However, there is one simple measure of how well our linear function fits the data that is known as the coefficient of determination and denoted by \(R^2\text{.}\)
We have seen that the predicted values from our model are given by \(\yhat = X \betahat\) and that the square of the distance \(\len{\yvec-\yhat}^2\) measures the amount by which the line fails to pass through the data points. When the line is close to the data points, we expect this number to be small. However, the size of this measure depends on the scale of the data. For instance, the two lines shown in Figure 6.5.10 seem to fit the data equally well, but \(|\yvec-\yhat|^2\) is 100 times larger on the right.
We can create a measure of fit that is indpendent of scale if we consider the relationship among three important vectors:
\(\yvec - \ybar\) holds the differences between the observed response values and their mean value. The (square of the) length of this vector is a measure of the total variability in the response variable.
\(\yvec - \yhat\) holds the differences between the observed response values and model fitted values. These differences are called residuals This vector is orthogonal to the model space, \(\col(X)\text{.}\)
\(\yhat - \ybar\) holds the differences between the fitted values and the mean response. Importantly, this vector is in the model space. We can see this as follows. The vector \(\yhat\) is in the model space by definition. If our model includes an intercept (so the first column of \(X\) is \(\onevec\)), then
So one way to interpret \(R^2\) is as the proportion of the variation in \(\yvec\) that is explained by the model (i.e., by \(\yhat\)). If \(\yhat = \yvec\text{,}\) the fit is perfect and \(R^2 = 1\text{.}\) At the other extreme, if our model predicts \(\yhat = \ybar\) (everyone is average), then \(R^2 = 0\text{.}\)
A more complete explanation of this definition relies on the concept of variance, which we explore in Exercise 6.5.8.12 and the next chapter. For the time being, it’s enough to know that \(0\leq R^2 \leq 1\) and that the closer \(R^2\) is to 1, the better the line fits the data. In our original example, illustrated in Figure 6.5.10, we find that \(R^2 = 0.75\text{,}\) and in our study of cricket chirp rates, we have \(R^2=0.69\text{.}\) However, assessing the confidence we have in predictions made by solving a least squares problem can require considerable thought, and it would be naive to rely only on the value of \(R^2\text{.}\)
There is also a connection between the correlation coefficient and the coefficient of of determination. For a simple linear model \(\yvec = \beta_0 + \beta_1 \xvec\text{,}\)
As we’ve seen, the least squares approximate solution \(\xhat\) to \(A\xvec=\bvec\) may be found by solving the normal equation \(A^{\transpose}A\xhat = A^{\transpose}\bvec\text{,}\) and this can be a practical strategy for some problems. However, this approach can be problematic as small rounding errors can accumulate and lead to inaccurate final results.
As the next activity demonstrates, there is an third method for finding the least squares approximate solution \(\xhat\) using a \(QR\) factorization of the matrix \(A\text{,}\) and this method is preferable as it is both computatinoally efficient and numerically more reliable. This is the method implemented in most statistical software packages.
Activity6.5.4.
Suppose we are interested in finding the least squares approximate solution to the equation \(A\xvec =
\bvec\) and that we have the \(QR\) factorization \(A=QR\text{.}\) Explain why the least squares approximation solution is given by solving
Since \(R\) is upper triangular, this is a relatively simple equation to solve using back substitution, as we saw in Section 4.7. We will therefore write the least squares approximate solution as
where \(\rho\) denotes a person’s body density in grams per cubic centimeter. Obtaining an accurate measure of \(\rho\) is difficult, however, because it requires submerging the person in water and measuring the volume of water displaced. Instead, we will gather several other body measurements, which are more easily obtained, and use it to predict \(BFI\text{.}\)
For instance, suppose we take 10 patients and measure their weight \(w\) in pounds, height \(h\) in inches, abdomen \(a\) in centimeters, wrist circumference \(r\) in centimeters, neck circumference \(n\) in centimeters, and \(BFI\text{.}\) Evaluating the following cell loads and displays the data.
In addition, that cell provides:
vectors weight, height, abdomen, wrist, neck, and BFI formed from the columns of the dataset.
the command onesvec(n), which returns an \(n\)-dimensional vector whose entries are all one.
the command QR(A) that returns the \(QR\) factorization of \(A\) as Q, R = QR(A).
the command demean(v), which returns the demeaned vector \(\widetilde{\vvec}\text{.}\)
Use the first data point to write an equation for the parameters \(\beta_0,\beta_1,\ldots,\beta_5\text{.}\)
Describe the linear system \(A\xvec = \bvec\) for these parameters. More specifically, describe how the matrix \(A\) and the vector \(\bvec\) are formed.
Construct the matrix \(A\) and find its \(QR\) factorization in the cell below.
Find the least squares approximate solution \(\xhat\) by solving the equation \(R\xhat =
Q^{\transpose}\bvec\text{.}\) You may want to use N(xhat) to display a decimal approximation of the vector. What are the parameters \(\beta_0,\beta_1,\ldots,\beta_5\) that best fit the data?
Find the coefficient of determination \(R^2\) for your parameters. What does this imply about the quality of the fit?
Suppose a person’s measurements are: weight 190, height 70, abdomen 90, wrist 18, and neck 35. Estimate this person’s \(BFI\text{.}\)
\(A\) is the \(10\by6\) matrix whose columns are a vector of all 1’s followed by the vectors of weights, heights, abdominal, wrist, and neck measurements. The vector \(\bvec\) is the vector of BFI readings.
\(Q\) is a \(10\by6\) matrix and \(R\) is a \(6\by6\) upper triangular matrix.
The columns of \(Q\) form an orthonormal basis for \(\col(A)\) so that \(\bhat=QQ^{\transpose}\bvec\text{.}\) The equation \(A\xhat=\bhat\) then becomes \(QR\xhat=QQ^{\transpose}\bvec\text{.}\)
Since \(Q^{\transpose}Q=I\text{,}\) we have \(Q^{\transpose}QR\xhat =
Q^{\transpose}QQ^{\transpose}\bvec\text{,}\) which gives \(R\xhat =
Q^{\transpose}\bvec\text{.}\)
\(A\) is the \(10\by6\) matrix whose columns are a vector of all 1’s followed by the vectors of weights, heights, abdominal, wrist, and neck measurements. The vector \(\bvec\) is the vector of BFI readings.
\(Q\) is a \(10\by6\) matrix and \(R\) is a \(6\by6\) upper triangular matrix.
If the columns of \(A\) are linearly independent and we have the \(QR\) factorization \(A=QR\text{,}\) then the least squares approximate solution \(\xhat\) to the equation \(A\xvec=\bvec\) is given by
In the examples we’ve seen so far, we have fit a linear function to a dataset. Sometimes, however, a polynomial, such as a quadratic function, may be more appropriate. It turns out that the techniques we’ve developed in this section are still useful as the next activity demonstrates.
Activity6.5.5.
Suppose that we have a small dataset containing the points \((0,2)\text{,}\)\((1,1)\text{,}\)\((2,3)\text{,}\) and \((3,3)\text{,}\) such as appear when the following cell is evaluated.
Write four equations, one for each data point, that describe the coefficients \(\beta_0\text{,}\)\(\beta_1\text{,}\) and \(\beta_2\text{.}\)
Express these four equations as a linear system \(\yvec = X \betavec \) where \(\betavec = \threevec{\beta_0}{\beta_1}{\beta_2}\text{.}\)
Find the \(QR\) factorization of \(X\) and use it to find the least squares approximate solution \(\betahat\text{.}\)
Use the parameters \(\beta_0\text{,}\)\(\beta_1\text{,}\) and \(\beta_2\) that you found to write the quadratic function that fits the data. Creat a plot that incluedes the raw data and the quadratic fit.
What is your predicted \(\hat{y}\) value when \(x=1.5\text{.}\)
Find the coefficient of determination \(R^2\) for the quadratic function. What does this say about the quality of the fit?
\(R^2=1\text{,}\) which means that we have a perfect fit.
The graph of the cubic function passes through each data point.
The matrices \(X\) that you created in the last activity when fitting a quadratic and cubic function to a dataset have a special form. In particular, if the data points are labeled \((x_i, y_i)\) and we seek a degree \(k\) polynomial, then
This is called a Vandermonde matrix of degree \(k\text{.}\) You can use numpy.polynomial.polynomial.polyvander() to create these matrices for a specified vector \(\xvec\) and degree.
Notice that \(0^0\) is treated as \(1\) for the purposes of this matrix.
Activity6.5.6.
This activity explores a dataset describing Arctic sea ice and that comes from Sustainability Math. 1
Evaluating the cell below will plot the extent of Arctic sea ice, in millions of square kilometers, during the twelve months of 1980, 2012, and 2017. We will focus primarily on 2012.
Find the vector \(\betahat\text{,}\) the least squares approximate solution to the linear system that results from fitting a degree 5 polynomial to the data.
Plot the data along with the fitted polynomial model.
Find the coefficient of determination \(R^2\) for this polynomial fit.
Repeat these steps to fit a degree 8 polynomial to the data, plot the polynomial with the data, and find \(R^2\text{.}\)
Repeat one more time by fitting a degree 11 polynomial to the data, creating a plot, and finding \(R^2\text{.}\)
It’s certainly true that higher degree polynomials fit the data better, as seen by the increasing values of \(R^2\text{,}\) but that’s not always a good thing. For instance, when \(k=11\text{,}\) you may notice that the graph of the polynomial wiggles a little more than we would expect. In this case, the polynomial is trying too hard to fit the data, which usually contains some uncertainty, especially if it’s obtained from measurements. The error built in to the data is called noise, and its presence means that we shouldn’t expect our polynomial to fit the data perfectly. When we choose a polynomial whose degree is too high, we give the noise too much influence over the fit of the model, which leads to some undesirable behavior, like the wiggles in the graph.
Fitting the data with a function that is too flexible and fits the training data better than it can be expected to fit new data high is called overfitting, a phenomenon that can appear in many machine learning applications. Generally speaking, we would like to choose \(k\) large enough to capture the essential features of the data but not so large that we overfit and build the noise into the model. What we really need is a method for selecting a good value for \(k\) and a better way to measure how well we should expect the model to fit new data, not the data used to train the model. That discussion would take us too far afield for the moment, but it is an important discussion.
Choosing a reasonable value of \(k\text{,}\) estimate the extent of Arctic sea ice at month 6.5, roughly at the Summer Solstice.
The fifth degree polynomial fits the data fairly well.
\(\displaystyle R^2=0.99\)
\(R^2=0.9997\text{.}\)
\(\displaystyle R^2=1\)
\(k=5\) seems like a good choice, and this gives the prediction of \(8.7\) million square kilometers of sea ice.
Subsection6.5.6Fitting linear models with standard tools
Coming soon.
Subsection6.5.7Summary
This section introduced some types of least squares problems and a framework for working with them.
Given an inconsistent system \(A\xvec=\bvec\text{,}\) we find \(\xhat\text{,}\) the least squares approximate solution by requiring that \(A\xhat\) be as close to \(\bvec\) as possible. In other words, we solve \(A\xhat = \bhat\) where \(\bhat = \proj{\bvec}{\col(A)}\)
One important application of this is fitting linear models to data. In that context, we typically use different letters. Instead of \(A \xvec = \bvec\text{,}\) you are more likely to see \(\yvec = X \betavec\text{.}\) Here
\(\yvec\) represents the response variable..
the variable we are trying to predict or estimate from other available data.
\(X\) represents the data matrix..
Each row of \(X\) represents an observation unit. Each column represents a data variable. Often we include a column of 1’s in \(X\text{.}\) This allows us to model an intercept which represents a baseline amount that is part of every prediction. \(X\) may include the results of applying a function to some of the "raw data", after all, that’s just another variable.
\(\betavec = \begin{bmatrix}\beta_0\\\beta_1\\ \vdots \\ \beta_p \\\end{bmatrix}\) represents the coefficients of the model..
One way to find \(\xhat\) with \(A \xhat = \bhat\) is by solving the normal equations
A second way to find \(\xhat\) with \(A \xhat = \bhat\) uses a \(QR\) factorization of \(A\text{.}\) If \(A=QR\text{,}\) then \(\xhat = R^{-1}Q^{\transpose}\bvec\) and finding \(R^{-1}\) is computationally feasible since \(R\) is upper triangular. Alternatively, we can use backsubstitution to solve \(R \xhat = Q^{\transpose} \bvec\text{.}\)
The statistical version of this is \(X = QR\) and \(R \betahat = Q^{\transpose} \yvec\text{.}\)
This technique may be applied widely and is useful for modeling data. We saw examples in this section where linear functions of several input variables and polynomials provided effective models for different datasets.
A simple measure of the quality of the fit is the coefficient of determination \(R^2\) though some care must be used in interpreting this number in context. In particular, as models become more complex, \(R^2\) generally increases because more flexible models can fit the data better. But they may be prone to overfitting. Our goal is generally not to fit the data at hand but to learn something of value about other data.
Exercises6.5.8Exercises
Here are some useful Python commands you may like to use in these exercises.
Q, R = scipy.linalg.qr(A) returns the \(QR\) factorization of a matrix \(A\) as Q and R,
np.ones(n) returns the \(n\)-dimensional vector whose entries are all 1,
x - x.mean() demeans the vector x,
numpy.vander(x, N) returns the Vandermonde matrix of degree N - 1 (N columns) formed from the components of the vector x, and
pandas.read_csv() reads csv data into a pandas data frame.
x.to_numpy() will convert pandas data frames and columns of data frames into NumPy arrays.
numpy.set_printoptions(precision = k, suppress = True) sets the precision used for NumPy’s numerical output and suppresses the use of scientific notation for small values.
pprint.pprint() sometimes offeres a nicer display of output.
The following packages and modules have been pre-loaded on this page.
Set up the matrix equation \(\yvec = X \betavec\) for a simple linear model (intercept and slope) for these data. That is, specify \(X\) and \(\yvec\text{.}\)
Write the normal equation that describes the least squares approximate solution to \(\yvec = X\betavec\text{.}\) Write the equation symbolically first, then plug in the matrices for this problem and simplify.
Find the least squares approximate solution \(\betahat\) and create a plot showing the original data and the least squares regression line.
What is your predicted \(y\)-value when \(x=3.5\text{?}\)
Find the coefficient of determination \(R^2\text{.}\)
Set up the matrix equation \(\yvec = X \betavec\) for a quadratic model (\(y = \beta_0 + \beta_1 x + \beta_2 x^2\)) for these data. That is, specify \(X\) and \(\yvec\text{.}\)
Use a \(QR\) factorization to find the least squares approximate solution \(\betahat\) and plot the data and the graph of the resulting quadratic function.
What is your predicted \(y\)-value when \(x=3.5\text{?}\)
Find the coefficient of determination \(R^2\text{.}\)
Find the least squares approximate solution \(\betahat\text{.}\)
What is your predicted \(y\)-value when \(x_1 =
2.4\) and \(x_2=2.9\text{?}\)
Find the coefficient of determination \(R^2\text{.}\)
Repeat for a quadratic model \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2\text{.}\)
5.
Determine whether the following statements are true or false and explain your thinking.
If \(A\xvec=\bvec\) is consistent, \(\bhat = \proj{\bvec}{\col(A)}\text{,}\) and \(A\xhat = \bhat\text{,}\) then \(\xhat\) is a solution to \(A\xvec=\bvec\text{.}\)
If \(R^2=1\text{,}\) then the least squares approximate solution \(\xhat\) is also a solution to the original equation \(A\xvec=\bvec\text{.}\)
Given the \(QR\) factorization \(A=QR\text{,}\) we have \(A\xhat=Q^{\transpose}Q\bvec\text{.}\)
A \(QR\) factorization provides a method for finding the approximate least squares solution to \(A\xvec=\bvec\) that is more reliable than solving the normal equations.
A solution to \(AA^{\transpose}\xvec = A\bvec\) is the least squares approximate solution to \(A\xvec = \bvec\text{.}\)
6.
Explain your response to the following questions.
In a least squares regression model, if \(\betahat=\zerovec\text{,}\) what does this say about the vector \(\yvec\text{?}\)
If the columns of \(X\) are orthonormal, how can you easily find the least squares approximate solution to \(\yvec = X\betavec\text{?}\)
7.
The following cell loads in some data showing the number of people in Bangladesh living without electricity over 27 years. It also defines vectors year, which records the years in the data set, and people, which records the number of people living without electricity.
Suppose we want to write
\begin{equation*}
N = \beta_0 + \beta_1 t
\end{equation*}
where \(t\) is the year and \(N\) is the number of people living without electricity. Construct the matrix \(X\) and vector \(\yvec\) so that the linear system \(X\betavec=\yvec\) describes the vector \(\betahat = \twovec{\beta_0}{\beta_1}\text{.}\)
Using a \(QR\) factorization of \(A\text{,}\) find the values of \(\beta_0\) and \(\beta_1\) in the least squares approximate solution \(\betahat\text{.}\)
What is the coefficient of determination \(R^2\) and what does this tell us about the quality of the approximation?
What is your prediction for the number of people living without electricity in 1985?
Estimate the year in which there will be no people living without electricity.
8.
This problem concerns a data set describing planets in our Solar system. For each planet, we have the length \(L\) of the semi-major axis, essentially the distance from the planet to the Sun in AU (astronomical units), and the period \(P\text{,}\) the length of time in years required to complete one orbit around the Sun.
We would like to model this data using the function \(P = CL^r\) where \(C\) and \(r\) are parameters we need to determine. Since this isn’t a linear function, we will transform this relationship by taking the natural logarithm of both sides to obtain
Evaluating the following cell loads the data set and defines two vectors logaxis, whose components are \(\log(L)\text{,}\) and logperiod, whose components are \(\log(P)\text{.}\)
Construct the matrix \(X\) and vector \(\yvec\) form Model 1 so that the solution to \(X\betavec=\yvec\) is the vector \(\betahat = \ctwovec{\log(C)}r\text{.}\)
Find the least squares approximate solution \(\betahat\text{.}\) What does this give for the values of \(C\) and \(r\text{?}\)
Find the coefficient of determination \(R^2\text{.}\) What does this tell us about the quality of the approximation?
Repeat the process above for Model 2.
Suppose that the orbit of an asteroid has a semi-major axis whose length is \(L=4.0\) AU. Estimate the period \(P\) of the asteroid’s orbit. Which model should you use? Do you get the same result if you use the other model?
Halley’s Comet has a period of \(P=75\) years. Estimate the length of its semi-major axis. Which model should you use? Do you get the same result if you use the other model?
9.
Evaluating the following cell loads a data set describing the temperature in the Earth’s atmosphere at various altitudes. There are also two vectors altitude, expressed in kilometers, and temperature, in degrees Celsius.
Describe how to form the matrix \(X\) and vector \(\yvec\) so that the linear system \(X\betavec=\yvec\) describes a degree \(k\) polynomial fitting the data.
After choosing a value of \(k\text{,}\) construct the matrix \(X\) and vector \(\yvec\text{,}\) and find the least squares approximate solution \(\betahat\text{.}\)
Plot the polynomial by sampling some points on the curve and plotting them.
Now examine what happens as you vary the degree of the polynomial \(k\text{.}\) Choose an appropriate value of \(k\) that seems to capture the most important features of the data while avoiding overfitting, and explain your choice.
Use your value of \(k\) to estimate the temperature at an altitude of 55 kilometers.
10.
The following cell loads some data describing 1057 houses in a particular real estate market. For each house, we record the living area in square feet, the lot size in acres, the age in years, and the price in dollars.
We will use linear regression to predict the price of a house given its living area, lot size, and age:
Use a \(QR\) factorization to find the least squares approximate solution \(\xhat\text{.}\)
Discuss what the of the signs of \(\hat\beta_1\text{,}\)\(\hat\beta_2\text{,}\) and \(\hat\beta_3\) tell you about prices of houses.
If two houses are identical except for differing in age by one year, how would you predict that their prices compare to each another?
Find the coefficient of determination \(R^2\text{.}\) What does this say about the quality of the fit?
Predict the price of a house whose living area is 2000 square feet, lot size is 1.5 acres, and age is 50 years.
11.
We observed that if the columns of \(A\) are linearly independent, then there is a unique least squares approximate solution to the equation \(A\xvec=\bvec\) because the equation \(A\xhat=\bhat\) has a unique solution. We also said that \(\xhat\) is the unique solution to the normal equation \(A^{\transpose}A\xhat = A^{\transpose}\bvec\) without explaining why this equation has a unique solution. This exercise offers an explanation.
Assuming that the columns of \(A\) are linearly independent, we would like to conclude that the equation \(A^{\transpose}A\xhat=A^{\transpose}\bvec\) has a unique solution.
Suppose that \(\xvec\) is a vector for which \(A^{\transpose}A\xvec = \zerovec\text{.}\) Explain why the following argument is valid and allows us to conclude that \(A\xvec = \zerovec\text{.}\)
In other words, if \(A^{\transpose}A\xvec = \zerovec\text{,}\) we know that \(A\xvec = \zerovec\text{.}\)
If the columns of \(A\) are linearly independent and \(A\xvec = \zerovec\text{,}\) what do we know about the vector \(\xvec\text{?}\)
Explain why \(A^{\transpose}A\xvec = \zerovec\) can only happen when \(\xvec = \zerovec\text{.}\)
Assuming that the columns of \(A\) are linearly independent, explain why \(A^{\transpose}A\xhat=A^{\transpose}\bvec\) has a unique solution.
12.
This problem is about the meaning of the coefficient of determination \(R^2\) and its connection to variance, a topic that appears in the next section. Throughout this problem, we consider the linear system \(\yvec = X\betavec\) and the approximate least squares solution \(\betahat\text{,}\) where \(\yhat = X\betahat\text{.}\) We suppose that \(X\) is an \(m\by n\) matrix, and we will denote the \(m\)-dimensional vector \(\onevec = \fourvec11{\vdots}1\text{.}\)
Explain why \(\ybar\text{,}\) the mean of the components of \(\yvec\text{,}\) can be found as the dot product
and hence why the mean of the components of \(\yvec^\perp = \yvec - \yhat\) is zero.
The variance of an \(m\)-dimensional vector \(\vvec\) is \(\var(\vvec) = \frac1m \len{\widetilde{\vvec}}^2\text{,}\) where \(\widetilde{\vvec}\) is the vector obtained by demeaning \(\vvec\text{.}\)
These expressions indicate why it is sometimes said that \(R^2\) measures the “fraction of the variance explained” by the model we are using to fit the data. As seen in the previous exercise, there may be other features that are not recorded in the dataset that influence the quantity we wish to predict.