Quick Review

The Linear Model

A general framework

Linear models (aka linear regression) and their generalizations explore how to use data to determine the relationship among two or more variables when this relationship is not known in advance. The general framework we will use is

\[ Y = f(x_1, x_2, \dots, x_k) + \varepsilon \]

  • \(Y\) is the response variable that we are trying to estimate from \(k\) explanatory or predictor variables \(x_1, x_2, \dots, x_k\).

  • The relationship between the explanatory variables and the response variables is described by a function \(f\).

  • The relationship described by \(f\) need not be a perfect fit. The error term in the model, \(\varepsilon\), describes how individual responses differ from the value given by \(f\): \(\varepsilon_i = Y_i - f(X_{1i}, X_{2i}, \dots, X_{ki})\)

    We will model \(\varepsilon\) with a distribution – typically a distribution with a mean of 0 – so another way to think about this model is the for a given values of the predictors, the values of \(Y\) have a distribution. The mean of this distribution is specified by \(f\) and the shape by \(\varepsilon\).

Special Case: The Simple Linear Regression Model

\[ Y = \beta_0 + \beta_1 x + \varepsilon \qquad \mbox{where $\varepsilon \sim {\sf Norm}(0,\sigma)$.} \]

In other words:

  • The mean response for a given predictor value \(x\) is given by a linear formula \[ \mbox{mean response} = \beta_0 + \beta_1 x \] This can also be written as \[ \operatorname{E}(Y \mid X = x) = \beta_0 + \beta_1 x \]

  • The values of \(\varepsilon\) are independent.

  • The distribution of all responses for a given predictor value \(x\) is normal.

  • The standard deviation of the responses is the same (equal) for each predictor value,

The mnemonic LINE can help us remember these aspects of the model:

  • Linear relationship
  • Independent errors
  • Normal errors
  • Equal standard deviation

Checking if the model in appropriate

We should check that each of the elements in LINE seems appropriate for our data set. Usually this is done with diagnostic plots.

You will see that residuals show up in every one of these. That’s because the residuals tell us how the data differ from the line, and the model tells us how they should differ from the line. So looking at residuals can help us see if the four components of our model are reasonable.

We can get the residuals from our model using resid(), but we will also see that some of this is automated even further with special plotting functions.

Linear Relationship

The obvious pot is a scatter plot of response vs explanatory.

Eyes <- 
  read.table(
    "https://rpruim.github.io/Engineering-Statistics/data/PalpebralFissure.txt", 
    header = TRUE)
gf_point(OSA ~ palpebral, data = Eyes)

These plots tend to fall along a diagonal, which means there is a lot of waste space – space we could use to magnify our view. Plotting residuals vs the explanatory variable or the fitted values does this. For a simple linear model, the two types of plots are nearly identical and provide the same information.

osa.model <- lm(OSA ~ palpebral, data = Eyes)
osa.hat <- makeFun(osa.model)  
predicted <- osa.hat(Eyes$palpebral)   # we could also use fitted(osa.model) here
gf_point( resid(osa.model) ~ palpebral, data = Eyes,
  title ="Residuals versus explanatory values") |
gf_point( resid(osa.model) ~ predicted, data = Eyes,
  title ="Residuals versus fitted values")

Independence of errors

An auto-correlation (acf) plot shows the lagged correlation coefficients. What does that mean? We check the correlation coefficient if we compare the residuals to shifted version of the residuals. That is we compare the first to the second, second to the third, etc. for a lag of 1. Or first to third, second to fourth, etc. for a lag of 2. Ideally, all of these correlation coefficients should be pretty small. (Except for lag 0, which will always have a correlation coefficient of 1 because we are comparing the residuals to themselves.

acf()1 provides some guidelines to help you just if things are small enough. If most or all of the correlations are within the guidelines (or nearly so), that’s good. If the correlations start large and gradually decrease, that’s a common bad situation suggesting that our measurements are correlated and that a different model should be used.

Things look good for this data set:

acf(resid(osa.model), ylab = "Residual ACF", main = "")

Here are the details for the first 2 lags.

tibble(
  resid = resid(osa.model) %>% round(3),
  lag1 = lag(resid, 1),
  lag2 = lag(resid, 2)
) %>% reactable::reactable()
gf_point(resid(osa.model) ~ lag(resid(osa.model))) |
gf_point(resid(osa.model) ~ lag(resid(osa.model), 2))
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).

cor(resid(osa.model) ~ lag(resid(osa.model)), use = "complete.obs")
## [1] -0.27
cor(resid(osa.model) ~ lag(resid(osa.model),2), use = "complete.obs")
## [1] 0.195

Normality of errors

A normal quantile plot is just the thing for this.

gf_qq( ~ resid(osa.model), data = Eyes, title = "Normal QQ plot") 

Equal Variance

Our residuals vs fitted values or residuals vs. explanatory plots can also give us a sense for whether the standard deviation is consistent across the data.

  • We are hoping to see that the residual have the same amount of scatter above and below 0 across the plot.

  • Note: If you hae more data in some places than in other places, you may see more spread where there is more data, so keep that in mind.

  • Any pattern in this plots is something that should be investigated. The model says these should be consistent noise.

Automatic plotting

These plots are made so often, there are functions to automate some of them. Of the several plots plot() and mplot() can make, the first two are most useful for us.

mplot(osa.model, which = 1:2)
## [[1]]
## `geom_smooth()` using formula 'y ~ x'

## 
## [[2]]

How good are our estimates?

If we pass the checks above, we can ask how good the estimates from our model are. Note, we are estimating several things here:

  1. Estimated slope

  2. Estimated intercept (usually the least interesting)

  3. Estimated fits/predictions

For each we would like to report a confidence interval or uncertainty.2

Coefficient estimates

The slope and intercept are sometimes called the coefficients of the linear regression model. We can get more information about these estimates using one of the functions below.

summary(osa.model)
## 
## Call:
## lm(formula = OSA ~ palpebral, data = Eyes)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.609 -0.199 -0.019  0.217  0.664 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.398      0.168   -2.37    0.025 *  
## palpebral      3.080      0.151   20.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.308 on 28 degrees of freedom
## Multiple R-squared:  0.937,  Adjusted R-squared:  0.935 
## F-statistic:  418 on 1 and 28 DF,  p-value: <2e-16
msummary(osa.model)        # slightly terser
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.398      0.168   -2.37    0.025 *  
## palpebral      3.080      0.151   20.45   <2e-16 ***
## 
## Residual standard error: 0.308 on 28 degrees of freedom
## Multiple R-squared:  0.937,  Adjusted R-squared:  0.935 
## F-statistic:  418 on 1 and 28 DF,  p-value: <2e-16
coef(summary(osa.model))   # just the coefficients part
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -0.398      0.168   -2.37 2.51e-02
## palpebral      3.080      0.151   20.45 2.25e-18

Notice the column labeled “Std. Error” – that’s our standard error (= standard uncertainty).

In the more verbose output, you will also see the degrees of freedom (\(n - 2\)) in this situation we can use to create confidence intervals using our usual template:

\[ \mbox{estimate} \pm \mbox{critical value} \cdot SE \]

3.080 + c(-1, 1) * qt(0.975, df = 28) * 0.151
## [1] 2.77 3.39

R can automate all of this for us:

confint(osa.model)
##              2.5 %  97.5 %
## (Intercept) -0.742 -0.0536
## palpebral    2.772  3.3884

Estimated response

When estimating a response, we need to determine whether we are interested in estimating

  • the average response for a given explanatory value [confidence interval]
  • an individual response for a given explanatory value [prediction interval]

Also, the standard error formula depends on the particular explanatory value we are interested in. Fortunately R can do all the required compuations for either a confidence or a prediction interval.

estimated.osa <- makeFun(osa.model) 
estimated.osa(1.2)
##   1 
## 3.3
estimated.osa(1.2, interval = "confidence")
##   fit  lwr  upr
## 1 3.3 3.17 3.42
estimated.osa(1.2, interval = "prediction")
##   fit  lwr  upr
## 1 3.3 2.66 3.94

We can visualize these intervals using gf_lm()

gf_point(OSA ~ palpebral, data = Eyes) %>%
  gf_lm(interval = "confidence", fill = "red") %>%
  gf_lm(interval = "prediction", fill = "skyblue")

Notice that

  • Prediction intervals are much wider than confidence intervals since it is harder to predict an individual response than the average over many responses.
  • Both intervals are narrowest near the center of the data and get wider as we move toward the edges of the data (or beyond, but be careful about extrapolation).

An estimate for \(\sigma\)

The R output also shows the estimated value for \(\sigma\). It is labeled “Residual standard error”. It is computed using

\[\hat \sigma^2 = \sqrt{ \frac{\sum{e_i^2}}{n-2} }\] where \(e_i\) is the \(i\)th residual and \(n-2\) is our degrees of freedom. The value of \(\hat \sigma\) is involved in all of the standard error formulas. It is also a measure of how much observations vary above and below the regression line.

Cautionary Notes

Don’t fit a line if a line doesn’t fit!

R is happy to fit a linear model to any data set, even if the relationship is nothing like a line. The linear model finds the “best” line assuming that a line is a good description of the relationship. Always check for linearity before using a linear model.

Watch for outliers

One or a few values that are very different from the rest can have a big difference on a linear model, so always keep an eye out for them.

Don’t Extrapolate!

While it often makes sense to generate model predictions corresponding to x-values within the range of values measured in the dataset, it is dangerous to extrapolate and make predictions for values outside the range included in the dataset. To assume that the linear relationship observed in the dataset holds for explanatory variable values outside the observed range, we would need a convincing, valid justification, which is usually not available. If we extrapolate anyway, we risk generating erroneous or even nonsense predictions. The problem generally gets worse as we stray further from the observed range of explanatory-variable values.

Some Pratice

1. Using the KidsFeet data set, fit a linear model for predicting foot width from foot length.

  1. Run our diagnostics to see if the model is appropriate.
  2. Give a 95% confidence interval for the slope. Compute this two times, once using the summary output and once using confint().
  3. Interpret the slope. What does the slope tell you about kids’ feet?
  4. Why isn’t the intercept particularly interesting for this example? [What does the intercept tell you?]
  5. Give a 95% confidence interval for the mean width of a foot among kids who have a foot length of 25 cm.
  6. Trey has a foot length of 25cm. Give a 95% prediction interval for the width of his foot.

2. The data set contains four pairs of explanatory (, , , and ) and response (, , , and ) variables. These data were constructed by Anscombe.

  1. For each of the four pairs, us  to fit a linear model and compare the results. Use, for example,
model1 <- lm(y1 ~ x1, data = anscombe) 
summary(model1)
  1. Briefly describe what you notice looking at the summary output for the four models.

  2. For each model, create a scatterplot that includes the regression line.

  3. Comment on the summaries and the plots. Why do you think Anscombe invented these data?


  1. Note: acf() uses a different plotting scheme from ggformula and sometimes you will need to adjust the size of your figure to make it look good.↩︎

  2. Note that these are essentially the same thing, just scaled – at least when the distributions are normal or t. In that case, standard uncertainty is basically the margin of error for a 68% confidence interval.↩︎