Fit lines to data using least squares. (Best line has smallest SSE.)
Regression line has some nice properties. (slope = \(r \frac{s_y}{s_x}\); \((\overline x, \overline y)\) is on the line)
lm()
will do all the work of estimating the slope and intercept for us (and more).
important to use the correct varaible as explanatory and response
Linear models (aka linear regression) and their generalizations explore how to use data to determine the relationship among two or more variables when this relationship is not known in advance. The general framework we will use is
\[ Y = f(x_1, x_2, \dots, x_k) + \varepsilon \]
\(Y\) is the response variable that we are trying to estimate from \(k\) explanatory or predictor variables \(x_1, x_2, \dots, x_k\).
The relationship between the explanatory variables and the response variables is described by a function \(f\).
The relationship described by \(f\) need not be a perfect fit. The error term in the model, \(\varepsilon\), describes how individual responses differ from the value given by \(f\): \(\varepsilon_i = Y_i - f(X_{1i}, X_{2i}, \dots, X_{ki})\)
We will model \(\varepsilon\) with a distribution – typically a distribution with a mean of 0 – so another way to think about this model is the for a given values of the predictors, the values of \(Y\) have a distribution. The mean of this distribution is specified by \(f\) and the shape by \(\varepsilon\).
\[ Y = \beta_0 + \beta_1 x + \varepsilon \qquad \mbox{where $\varepsilon \sim {\sf Norm}(0,\sigma)$.} \]
In other words:
The mean response for a given predictor value \(x\) is given by a linear formula \[ \mbox{mean response} = \beta_0 + \beta_1 x \] This can also be written as \[ \operatorname{E}(Y \mid X = x) = \beta_0 + \beta_1 x \]
The values of \(\varepsilon\) are independent.
The distribution of all responses for a given predictor value \(x\) is normal.
The standard deviation of the responses is the same (equal) for each predictor value,
The mnemonic LINE can help us remember these aspects of the model:
We should check that each of the elements in LINE seems appropriate for our data set. Usually this is done with diagnostic plots.
You will see that residuals show up in every one of these. That’s because the residuals tell us how the data differ from the line, and the model tells us how they should differ from the line. So looking at residuals can help us see if the four components of our model are reasonable.
We can get the residuals from our model using resid()
, but we will also see that some of this is automated even further with special plotting functions.
The obvious pot is a scatter plot of response vs explanatory.
Eyes <-
read.table(
"https://rpruim.github.io/Engineering-Statistics/data/PalpebralFissure.txt",
header = TRUE)
gf_point(OSA ~ palpebral, data = Eyes)
These plots tend to fall along a diagonal, which means there is a lot of waste space – space we could use to magnify our view. Plotting residuals vs the explanatory variable or the fitted values does this. For a simple linear model, the two types of plots are nearly identical and provide the same information.
osa.model <- lm(OSA ~ palpebral, data = Eyes)
osa.hat <- makeFun(osa.model)
predicted <- osa.hat(Eyes$palpebral) # we could also use fitted(osa.model) here
gf_point( resid(osa.model) ~ palpebral, data = Eyes,
title ="Residuals versus explanatory values") |
gf_point( resid(osa.model) ~ predicted, data = Eyes,
title ="Residuals versus fitted values")
An auto-correlation (acf) plot shows the lagged correlation coefficients. What does that mean? We check the correlation coefficient if we compare the residuals to shifted version of the residuals. That is we compare the first to the second, second to the third, etc. for a lag of 1. Or first to third, second to fourth, etc. for a lag of 2. Ideally, all of these correlation coefficients should be pretty small. (Except for lag 0, which will always have a correlation coefficient of 1 because we are comparing the residuals to themselves.
acf()
1 provides some guidelines to help you just if things are small enough. If most or all of the correlations are within the guidelines (or nearly so), that’s good. If the correlations start large and gradually decrease, that’s a common bad situation suggesting that our measurements are correlated and that a different model should be used.
Things look good for this data set:
acf(resid(osa.model), ylab = "Residual ACF", main = "")
Here are the details for the first 2 lags.
tibble(
resid = resid(osa.model) %>% round(3),
lag1 = lag(resid, 1),
lag2 = lag(resid, 2)
) %>% reactable::reactable()
gf_point(resid(osa.model) ~ lag(resid(osa.model))) |
gf_point(resid(osa.model) ~ lag(resid(osa.model), 2))
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
cor(resid(osa.model) ~ lag(resid(osa.model)), use = "complete.obs")
## [1] -0.27
cor(resid(osa.model) ~ lag(resid(osa.model),2), use = "complete.obs")
## [1] 0.195
A normal quantile plot is just the thing for this.
gf_qq( ~ resid(osa.model), data = Eyes, title = "Normal QQ plot")
Our residuals vs fitted values or residuals vs. explanatory plots can also give us a sense for whether the standard deviation is consistent across the data.
We are hoping to see that the residual have the same amount of scatter above and below 0 across the plot.
Note: If you hae more data in some places than in other places, you may see more spread where there is more data, so keep that in mind.
Any pattern in this plots is something that should be investigated. The model says these should be consistent noise.
These plots are made so often, there are functions to automate some of them. Of the several plots plot()
and mplot()
can make, the first two are most useful for us.
mplot(osa.model, which = 1:2)
## [[1]]
## `geom_smooth()` using formula 'y ~ x'
##
## [[2]]
If we pass the checks above, we can ask how good the estimates from our model are. Note, we are estimating several things here:
Estimated slope
Estimated intercept (usually the least interesting)
Estimated fits/predictions
For each we would like to report a confidence interval or uncertainty.2
The slope and intercept are sometimes called the coefficients of the linear regression model. We can get more information about these estimates using one of the functions below.
summary(osa.model)
##
## Call:
## lm(formula = OSA ~ palpebral, data = Eyes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.609 -0.199 -0.019 0.217 0.664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.398 0.168 -2.37 0.025 *
## palpebral 3.080 0.151 20.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.308 on 28 degrees of freedom
## Multiple R-squared: 0.937, Adjusted R-squared: 0.935
## F-statistic: 418 on 1 and 28 DF, p-value: <2e-16
msummary(osa.model) # slightly terser
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.398 0.168 -2.37 0.025 *
## palpebral 3.080 0.151 20.45 <2e-16 ***
##
## Residual standard error: 0.308 on 28 degrees of freedom
## Multiple R-squared: 0.937, Adjusted R-squared: 0.935
## F-statistic: 418 on 1 and 28 DF, p-value: <2e-16
coef(summary(osa.model)) # just the coefficients part
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.398 0.168 -2.37 2.51e-02
## palpebral 3.080 0.151 20.45 2.25e-18
Notice the column labeled “Std. Error” – that’s our standard error (= standard uncertainty).
In the more verbose output, you will also see the degrees of freedom (\(n - 2\)) in this situation we can use to create confidence intervals using our usual template:
\[ \mbox{estimate} \pm \mbox{critical value} \cdot SE \]
3.080 + c(-1, 1) * qt(0.975, df = 28) * 0.151
## [1] 2.77 3.39
R can automate all of this for us:
confint(osa.model)
## 2.5 % 97.5 %
## (Intercept) -0.742 -0.0536
## palpebral 2.772 3.3884
When estimating a response, we need to determine whether we are interested in estimating
Also, the standard error formula depends on the particular explanatory value we are interested in. Fortunately R can do all the required compuations for either a confidence or a prediction interval.
estimated.osa <- makeFun(osa.model)
estimated.osa(1.2)
## 1
## 3.3
estimated.osa(1.2, interval = "confidence")
## fit lwr upr
## 1 3.3 3.17 3.42
estimated.osa(1.2, interval = "prediction")
## fit lwr upr
## 1 3.3 2.66 3.94
We can visualize these intervals using gf_lm()
gf_point(OSA ~ palpebral, data = Eyes) %>%
gf_lm(interval = "confidence", fill = "red") %>%
gf_lm(interval = "prediction", fill = "skyblue")
Notice that
The R output also shows the estimated value for \(\sigma\). It is labeled “Residual standard error”. It is computed using
\[\hat \sigma^2 = \sqrt{ \frac{\sum{e_i^2}}{n-2} }\] where \(e_i\) is the \(i\)th residual and \(n-2\) is our degrees of freedom. The value of \(\hat \sigma\) is involved in all of the standard error formulas. It is also a measure of how much observations vary above and below the regression line.
R is happy to fit a linear model to any data set, even if the relationship is nothing like a line. The linear model finds the “best” line assuming that a line is a good description of the relationship. Always check for linearity before using a linear model.
One or a few values that are very different from the rest can have a big difference on a linear model, so always keep an eye out for them.
While it often makes sense to generate model predictions corresponding to x-values within the range of values measured in the dataset, it is dangerous to extrapolate and make predictions for values outside the range included in the dataset. To assume that the linear relationship observed in the dataset holds for explanatory variable values outside the observed range, we would need a convincing, valid justification, which is usually not available. If we extrapolate anyway, we risk generating erroneous or even nonsense predictions. The problem generally gets worse as we stray further from the observed range of explanatory-variable values.
1. Using the KidsFeet
data set, fit a linear model for predicting foot width from foot length.
confint()
.2. The data set contains four pairs of explanatory (, , , and ) and response (, , , and ) variables. These data were constructed by Anscombe.
model1 <- lm(y1 ~ x1, data = anscombe)
summary(model1)
Briefly describe what you notice looking at the summary output for the four models.
For each model, create a scatterplot that includes the regression line.
Comment on the summaries and the plots. Why do you think Anscombe invented these data?
Note: acf()
uses a different plotting scheme from ggformula and sometimes you will need to adjust the size of your figure to make it look good.↩︎
Note that these are essentially the same thing, just scaled – at least when the distributions are normal or t. In that case, standard uncertainty is basically the margin of error for a 68% confidence interval.↩︎