Consider the following small data set.
SomeData <- tibble(
x = c(1,2,4,5,6),
y = c(1,4,3,6,5)
)
SomeData
## # A tibble: 5 x 2
## x y
## <dbl> <dbl>
## 1 1 1
## 2 2 4
## 3 4 3
## 4 5 6
## 5 6 5
gf_point( y ~ x, data = SomeData) %>%
gf_lims(y = c(0, 7), x = c(0, 7))
The points fall roughly along a line. What line is the best line to describe the relationship between x and y?
We want to determine the best fitting line to the data. The usual method is the method of least squares, which chooses the line that has the smallest possible sum of squares of residuals, where residuals are defined by
\[ \mbox{residual} = \mbox{observed} - \mbox{predicted} \]
So, returning to our previous example, we can get the least squares regression line as follows:
Example Suppose we we select a line that passes through \((1,2)\) and \((6,6)\). the equation for this line is \(y = 1.2 + 0.8 x\), and it looks like a pretty good fit:
f <- makeFun( 1.2 + 0.8 * x ~ x)
gf_point(y ~ x, data = SomeData) %>%
gf_lims(x = c(0, 7), y = c(0, 7)) %>%
gf_fun( f(x) ~ x, col = "gray50" )
The residuals for this function are
resids <- with(SomeData, y - f(x)) ; resids
## [1] -1.0 1.2 -1.4 0.8 -1.0
and \(SSE\) is
sum(resids^2)
## [1] 6.04
The following plot provides a way to visualize the sum of the squared residuals (SSE).
If your line is a good fit, then \(SSE\) will be small.
The best fitting line will have the smallest possible \(SSE\).
The lm()
function will find this best fitting line for us.
model1 <- lm( y ~ x, data = SomeData ); model1
##
## Call:
## lm(formula = y ~ x, data = SomeData)
##
## Coefficients:
## (Intercept) x
## 1.163 0.733
This says that the equation of the best fit line is \[ \hat y = 1.163 + 0.733 x \]
gf_point(y ~ x, data = SomeData) %>%
gf_lm()
We can compute \(SSE\) using the resid()
function.
SSE <- sum (resid(model1)^2); SSE
## [1] 5.57
As we see, this is a better fit than our first attempt – at least according to the least squares criterion. It will better than any other attempt – it is the least squares regression line.
For a line with equation \(y = \hat\beta_0 + \hat\beta_1 x\), the residuals are \[
e_i = y_i - (\hat\beta_0 + \hat\beta_1 x)
\] and the sum of the squares of the residuals is \[
SSE = \sum e_i^2 = \sum (y_i - (\hat\beta_0 + \hat\beta_1 x) )^2
\] Simple calculus (which we won’t do here) allows us to compute the best \(\hat\beta_0\) and \(\hat\beta_1\) possible.
These best values define the least squares regression line. We always compute these values using software, but it is good to note that the least squares line satisfies two very nice properties.
The point \((\overline x, \overline y)\) is on the line.
This means that \(\overline y = \hat\beta_0 + \hat\beta_1 \overline x\) (and \(\hat\beta_0 = \overline y - \hat\beta_1 \overline x\))
The slope of the line is \(\displaystyle b = r \frac{s_y}{s_x}\) where \(r\) is the correlation coefficient: \[ r = \frac{1}{n-1} \sum \frac{ x_i - \overline x }{s_x} \cdot \frac{ y_i - \overline y }{s_y} \]
Since we have a point and the slope, it is easy to compute the equation for the line if we know \(\overline x\), \(s_x\), \(\overline y\), \(s_y\), and \(r\).
Example In a study of eye strain caused by visual display terminals, researchers wanted to be able to estimate ocular surface area (OSA) from palpebral fissure (the horizontal width of the eye opening in cm) because palpebral fissue is easier to measure than OSA.
Eyes <-
read.table(
"https://rpruim.github.io/Engineering-Statistics/data/PalpebralFissure.txt",
header = TRUE)
head(Eyes, 3)
## palpebral OSA
## 1 0.40 1.02
## 2 0.42 1.21
## 3 0.48 0.88
x.bar <- mean( ~ palpebral, data = Eyes)
y.bar <- mean( ~ OSA, data = Eyes)
s_x <- sd( ~ palpebral, data = Eyes)
s_y <- sd( ~ OSA, data = Eyes)
r <- cor( palpebral ~ OSA, data = Eyes)
c( x.bar = x.bar, y.bar = y.bar, s_x = s_x, s_y = s_y, r = r )
## x.bar y.bar s_x s_y r
## 1.051 2.840 0.380 1.208 0.968
slope <- r * s_y / s_x
intercept <- y.bar - slope * x.bar
c(intercept = intercept, slope = slope)
## intercept slope
## -0.398 3.080
Fortunately, statistical software packages do all this work for us, so the calculations of the preceding example don’t need to be done in practice.
In a study of eye strain caused by visual display terminals, researchers wanted to be able to estimate ocular surface area (OSA) from palpebral fissure (the horizontal width of the eye opening in cm) because palpebral fissue is easier to measure than OSA.
osa.model <- lm(OSA ~ palpebral, data = Eyes)
osa.model
##
## Call:
## lm(formula = OSA ~ palpebral, data = Eyes)
##
## Coefficients:
## (Intercept) palpebral
## -0.398 3.080
lm()
stands for linear model. The default output includes the estimates of the coefficients (\(\hat\beta_0\) and \(\hat \beta_1\)) based on the data. If that is the only information we want, then we can use
coef(osa.model)
## (Intercept) palpebral
## -0.398 3.080
This means that the equation of the least squares regression line is \[ \hat y = -0.398 + 3.08 x \]
We use \(\hat y\) to indicate that this is not an observed value of the response variable but an estimated value (based on the linear equation given).
R can add a regression line to our scatter plot if we ask it to.
gf_point( OSA ~ palpebral, data = Eyes) %>%
gf_lm()
We see that the line does run roughly ``through the middle’’ of the data but that there is some variability above and below the line.
It is important that the explanatory variable be the x
variable and the response variable be the y
variable when doing regression. If we reverse the roles of OSA
and palpebral
we do not get the same model. This is because the residuals are measured vertically (in the y
direction).
Remember: x for explanatory.
Linear models (aka linear regression) and their generalizations explore how to use data to determine the relationship among two or more variables when this relationship is not known in advance. The general framework we will use is
\[ Y = f(x_1, x_2, \dots, x_k) + \varepsilon \]
\(Y\) is the response variable that we are trying to estimate from \(k\) explanatory or predictor variables \(x_1, x_2, \dots, x_k\).
The relationship between the explanatory variables and the response variables is described by a function \(f\).
The relationship described by \(f\) need not be a perfect fit. The error term in the model, \(\varepsilon\), describes how individual responses differ from the value given by \(f\).
We will model \(\varepsilon\) with a distribution – typically a distribution with a mean of 0 – so another way to think about this model is the for a given values of the predictors, the values of \(Y\) have a distribution. The mean of this distribution is specified by \(f\) and the shape by \(\varepsilon\).
\[ Y = \beta_0 + \beta_1 x + \varepsilon \qquad \mbox{where $\varepsilon \sim {\sf Norm}(0,\sigma)$.} \]
In other words: * The mean response for a given predictor value \(x\) is given by a linear formula \[ \mbox{mean response} = \beta_0 + \beta_1 x \] This can also be written as \[ \operatorname{E}(Y \mid X = x) = \beta_0 + \beta_1 x \]
Furthermore, in this model the values of \(\varepsilon\) are independent.
We can use our least squares regression line to estimate the value of the response variable from the value of the explanatory variable.
Example If the palpebral width is 1.2 cm, then we would estimate OSA to be
\[ \hat{\texttt{osa}} = -0.398 + 3.08 \cdot 1.2 = 3.298 \]
can automate this for us too. The makeFun()
function will create a function from our model. If we input a palpebral measurement into this function, the function will return the estimated OSA.
estimated.osa <- makeFun(osa.model)
estimated.osa(1.2)
## 1
## 3.3
As it turns out, the 17th measurement in our data set had a palpebral
measurement of 1.2 cm.
Eyes[17,]
## palpebral OSA
## 17 1.2 3.76
The corresponding OSA of 3.76 means that the residual for this observation is \[ \mbox{observed} - \mbox{predicted} = 3.76 - 3.298 = 0.462 \]
While it often makes sense to generate model predictions corresponding to x-values within the range of values measured in the dataset, it is dangerous to extrapolate and make predictions for values outside the range included in the dataset. To assume that the linear relationship observed in the dataset holds for explanatory variable values outside the observed range, we would need a convincing, valid justification, which is usually not available. If we extrapolate anyway, we risk generating erroneous or even nonsense predictions. The problem generally gets worse as we stray further from the observed range of explanatory-variable values.
1. A data set with 35 observations of two variables has a
What is the equation of the regression line?