Fitting a Line to Data

Consider the following small data set.

SomeData <- tibble(
  x = c(1,2,4,5,6),
  y = c(1,4,3,6,5)
)
SomeData
## # A tibble: 5 x 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
## 2     2     4
## 3     4     3
## 4     5     6
## 5     6     5
gf_point( y ~ x, data = SomeData) %>%
  gf_lims(y = c(0, 7), x = c(0, 7))

The points fall roughly along a line. What line is the best line to describe the relationship between x and y?

The Least Squares Method

We want to determine the best fitting line to the data. The usual method is the method of least squares, which chooses the line that has the smallest possible sum of squares of residuals, where residuals are defined by

\[ \mbox{residual} = \mbox{observed} - \mbox{predicted} \]

So, returning to our previous example, we can get the least squares regression line as follows:

  1. Add a line to the plot that ``fits the data well". Don’t do any calculations, just add the line.
  2. Now estimate the residuals for each point relative to your line
  3. Compute the sum of the squared residuals, \(SSE\).
  4. Estimate the slope and intercept of your line.

Example Suppose we we select a line that passes through \((1,2)\) and \((6,6)\). the equation for this line is \(y = 1.2 + 0.8 x\), and it looks like a pretty good fit:

f <- makeFun( 1.2 + 0.8 * x ~ x)
gf_point(y ~ x, data = SomeData) %>% 
  gf_lims(x = c(0, 7), y = c(0, 7)) %>%
  gf_fun( f(x) ~ x, col = "gray50" )

The residuals for this function are

resids <- with(SomeData, y - f(x)) ; resids 
## [1] -1.0  1.2 -1.4  0.8 -1.0

and \(SSE\) is

sum(resids^2)
## [1] 6.04

The following plot provides a way to visualize the sum of the squared residuals (SSE).

  • If your line is a good fit, then \(SSE\) will be small.

  • The best fitting line will have the smallest possible \(SSE\).

  • The lm() function will find this best fitting line for us.

model1 <- lm( y ~ x, data = SomeData ); model1
## 
## Call:
## lm(formula = y ~ x, data = SomeData)
## 
## Coefficients:
## (Intercept)            x  
##       1.163        0.733

This says that the equation of the best fit line is \[ \hat y = 1.163 + 0.733 x \]

gf_point(y ~ x, data = SomeData) %>%
  gf_lm()

We can compute \(SSE\) using the resid() function.

SSE <- sum (resid(model1)^2); SSE
## [1] 5.57

As we see, this is a better fit than our first attempt – at least according to the least squares criterion. It will better than any other attempt – it is the least squares regression line.

Properties of the Least Squares Regression Line

For a line with equation \(y = \hat\beta_0 + \hat\beta_1 x\), the residuals are \[ e_i = y_i - (\hat\beta_0 + \hat\beta_1 x) \] and the sum of the squares of the residuals is \[ SSE = \sum e_i^2 = \sum (y_i - (\hat\beta_0 + \hat\beta_1 x) )^2 \] Simple calculus (which we won’t do here) allows us to compute the best \(\hat\beta_0\) and \(\hat\beta_1\) possible.
These best values define the least squares regression line. We always compute these values using software, but it is good to note that the least squares line satisfies two very nice properties.

  1. The point \((\overline x, \overline y)\) is on the line.

  2. This means that \(\overline y = \hat\beta_0 + \hat\beta_1 \overline x\) (and \(\hat\beta_0 = \overline y - \hat\beta_1 \overline x\))

  3. The slope of the line is \(\displaystyle b = r \frac{s_y}{s_x}\) where \(r\) is the correlation coefficient: \[ r = \frac{1}{n-1} \sum \frac{ x_i - \overline x }{s_x} \cdot \frac{ y_i - \overline y }{s_y} \]

Since we have a point and the slope, it is easy to compute the equation for the line if we know \(\overline x\), \(s_x\), \(\overline y\), \(s_y\), and \(r\).

Example In a study of eye strain caused by visual display terminals, researchers wanted to be able to estimate ocular surface area (OSA) from palpebral fissure (the horizontal width of the eye opening in cm) because palpebral fissue is easier to measure than OSA.

Eyes <- 
  read.table(
    "https://rpruim.github.io/Engineering-Statistics/data/PalpebralFissure.txt", 
    header = TRUE)
head(Eyes, 3) 
##   palpebral  OSA
## 1      0.40 1.02
## 2      0.42 1.21
## 3      0.48 0.88
x.bar <- mean( ~ palpebral, data = Eyes)  
y.bar <- mean( ~ OSA, data = Eyes)
s_x <- sd( ~ palpebral, data = Eyes)
s_y <- sd( ~ OSA, data = Eyes)
r <- cor( palpebral ~ OSA, data = Eyes)
c( x.bar = x.bar, y.bar = y.bar, s_x = s_x, s_y = s_y, r = r )
## x.bar y.bar   s_x   s_y     r 
## 1.051 2.840 0.380 1.208 0.968
slope <- r * s_y / s_x
intercept <- y.bar - slope * x.bar
c(intercept = intercept, slope = slope)
## intercept     slope 
##    -0.398     3.080

Fortunately, statistical software packages do all this work for us, so the calculations of the preceding example don’t need to be done in practice.

In a study of eye strain caused by visual display terminals, researchers wanted to be able to estimate ocular surface area (OSA) from palpebral fissure (the horizontal width of the eye opening in cm) because palpebral fissue is easier to measure than OSA.

osa.model <- lm(OSA ~ palpebral, data = Eyes) 
osa.model
## 
## Call:
## lm(formula = OSA ~ palpebral, data = Eyes)
## 
## Coefficients:
## (Intercept)    palpebral  
##      -0.398        3.080

lm() stands for linear model. The default output includes the estimates of the coefficients (\(\hat\beta_0\) and \(\hat \beta_1\)) based on the data. If that is the only information we want, then we can use

coef(osa.model)
## (Intercept)   palpebral 
##      -0.398       3.080

This means that the equation of the least squares regression line is \[ \hat y = -0.398 + 3.08 x \]

We use \(\hat y\) to indicate that this is not an observed value of the response variable but an estimated value (based on the linear equation given).

R can add a regression line to our scatter plot if we ask it to.

gf_point( OSA ~ palpebral, data = Eyes) %>%
  gf_lm()

We see that the line does run roughly ``through the middle’’ of the data but that there is some variability above and below the line.

Explanatory and Response Variables Matter

It is important that the explanatory variable be the x variable and the response variable be the y variable when doing regression. If we reverse the roles of OSA and palpebral we do not get the same model. This is because the residuals are measured vertically (in the y direction).

Remember: x for explanatory.


A more general framework

Linear models (aka linear regression) and their generalizations explore how to use data to determine the relationship among two or more variables when this relationship is not known in advance. The general framework we will use is

\[ Y = f(x_1, x_2, \dots, x_k) + \varepsilon \]

The Simple Linear Regression Model

\[ Y = \beta_0 + \beta_1 x + \varepsilon \qquad \mbox{where $\varepsilon \sim {\sf Norm}(0,\sigma)$.} \]

In other words: * The mean response for a given predictor value \(x\) is given by a linear formula \[ \mbox{mean response} = \beta_0 + \beta_1 x \] This can also be written as \[ \operatorname{E}(Y \mid X = x) = \beta_0 + \beta_1 x \]

Furthermore, in this model the values of \(\varepsilon\) are independent.

Estimating the Response

We can use our least squares regression line to estimate the value of the response variable from the value of the explanatory variable.

Example If the palpebral width is 1.2 cm, then we would estimate OSA to be

\[ \hat{\texttt{osa}} = -0.398 + 3.08 \cdot 1.2 = 3.298 \]

 can automate this for us too. The makeFun() function will create a function from our model. If we input a palpebral measurement into this function, the function will return the estimated OSA.

estimated.osa <- makeFun(osa.model)
estimated.osa(1.2)
##   1 
## 3.3

As it turns out, the 17th measurement in our data set had a palpebral measurement of 1.2 cm.

Eyes[17,]
##    palpebral  OSA
## 17       1.2 3.76

The corresponding OSA of 3.76 means that the residual for this observation is \[ \mbox{observed} - \mbox{predicted} = 3.76 - 3.298 = 0.462 \]

Cautionary Note: Don’t Extrapolate!

While it often makes sense to generate model predictions corresponding to x-values within the range of values measured in the dataset, it is dangerous to extrapolate and make predictions for values outside the range included in the dataset. To assume that the linear relationship observed in the dataset holds for explanatory variable values outside the observed range, we would need a convincing, valid justification, which is usually not available. If we extrapolate anyway, we risk generating erroneous or even nonsense predictions. The problem generally gets worse as we stray further from the observed range of explanatory-variable values.

Some Pratice

1. A data set with 35 observations of two variables has a

What is the equation of the regression line?