Let use the KidsFeet
data set, to predict foot width from foot length.
Create a scatter plot of the two variables. Which should go on the y-axis? Why?
Explanatory on the x-axis, response on the y-axis.
Use %>% gf_lm()
to add a regression line to your scatter plot.
gf_point(width ~ length, data = KidsFeet) %>% gf_lm()
Using the lm()
function, fit a linear model to predict foot width from foot length.
What are the slope and intercept of the regression line?
Write down an equation for predicting width from length by filling in the blanks
\[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]
What does the hat on top of width mean in this equation?
Use your equation to predict the width of a foot that is 24 cm long.
Use your equation to predict the width of a foot that is 26 cm long.
What is the difference in these two predictions? Why?
Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)
(You might find the following useful: KidsFeet %>% View()
. You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)
lm(width ~ length, data = KidsFeet)
##
## Call:
## lm(formula = width ~ length, data = KidsFeet)
##
## Coefficients:
## (Intercept) length
## 2.862 0.248
So equation is \(\hat{\mathrm{width}} = 2.86 + 0.248 \cdot \mathrm{length}\).
If foot length is 24cm: \(\hat{\mathrm{width}} = 2.86 + 0.248 \cdot 24 = 8.812\).
If foot length is 26cm: \(\hat{\mathrm{width}} = 2.86 + 0.248 \cdot 26 = 9.308\).
The difference is \(2 \cdot 0.248\), twice the slope. If you change the predictor by 2, the predicted response changes by 2 times the slope.
Julie and Corline are obsersvations 15 and 17. You can just scan for them in the data. Here is a way to get just those two rows. (pander()
does fancier printing.)
KidsFeet %>%
filter(name %in% c('Julie', 'Caroline')) %>%
pander()
name | birthmonth | birthyear | length | width | sex | biggerfoot | domhand |
---|---|---|---|---|---|---|---|
Julie | 11 | 87 | 26 | 9.3 | G | L | R |
Caroline | 12 | 87 | 24 | 8.7 | G | R | L |
So the residuals are
If we save the result of lm()
, we can do some extra things.
Save your model as KF_model
KF_model <- lm(width ~ length, data = KidsFeet)
Compute the predicted values for every kid in the dataset using fitted(KF_model)
. What order are the results in? Find the fitted values for Julie and Caroline.
Compute the residuals for every kid in the dataset using resid(KF_model)
. Find the residuals for Julie and Caroline.
Compute the mean of the residuals using mean(resid(KF_model))
. What interesting value do you get?1
mean(resid(KF_model))
## [1] -1.388e-17
Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length
.
Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model)
. How does this scatter plot compare to the one you just made?
The intercept and slope of a regression line are called the coefficients. Give coef(KF_model)
a try and see what you get.
What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted()
.)
Here is a way to get R to compute the previous value.
predicted_width <- makeFun(KF_model)
predicted_width(length = 22)
## 1
## 8.317
What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?
KF_model <-
lm(width ~ length, data = KidsFeet)
Fits and residuals are in the same order as the original data2, so we want items 15 and 17:
fitted(KF_model)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 8.912 9.160 8.937 9.111 9.086 9.235 9.334 8.565 8.714 8.540 9.681 9.011 9.334
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 9.557 9.309 8.739 8.813 8.987 9.482 9.185 8.813 8.912 8.813 8.937 8.863 9.582
## 27 28 29 30 31 32 33 34 35 36 37 38 39
## 9.334 9.185 8.863 8.788 8.813 8.441 8.937 8.714 8.987 8.540 9.309 8.218 8.962
resid(KF_model)
## 1 2 3 4 5 6 7 8
## -0.512201 -0.360149 0.763004 0.689440 -0.185765 0.465466 0.266287 0.234925
## 9 10 11 12 13 14 15 16
## 0.586157 0.259720 0.119160 -0.111381 -0.233713 0.243134 -0.008918 -0.838638
## 17 18 19 20 21 22 23 24
## -0.113022 -0.186586 -0.482481 0.315056 0.386978 -0.312201 -0.513022 0.063004
## 25 26 27 28 29 30 31 32
## -0.762612 -0.181660 0.166287 0.315056 0.037388 0.511772 0.486978 0.158899
## 33 34 35 36 37 38 39
## -0.336996 0.286157 -0.386586 -0.040280 -0.308918 -0.317948 -0.161791
fitted(KF_model)[15]
## 15
## 9.309
fitted(KF_model)[17]
## 17
## 8.813
resid(KF_model)[15]
## 15
## -0.008918
resid(KF_model)[17]
## 17
## -0.113
These two plots looks basically the same except for the axis labeling:
gf_point(resid(KF_model) ~ length, data = KidsFeet) %>%
gf_labs(title = "residuals vs explanatory variable")
gf_point(resid(KF_model) ~ fitted(KF_model), data = KidsFeet) %>%
gf_labs(title = "residuals vs fitted values")
Coefficients:
coef(KF_model) # coefficients (intercept and slope)
## (Intercept) length
## 2.8623 0.2479
Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.
Here is the matching
Why might residual plots be better for showing how well a line fits the data than the original scatter plot?
The use the space better, essentially maginfying things so we can see them better.
Compare plots A–D with plots W–Z. How do those two types of residual plots compare?3
Notice that A and Y and C and Z are mirror images of each other. That happens when the slope is negative. Otherwise things are identical (in pairs) except for the axis labeling.
Often you will see regression lines summarized with a table like the one produced by summary(KF_model)
. Give it a try. There will be more information there than we have learned about, but you should be able to locate the intercept and slope in such a summary.4
One of the things listed in that summary is labeled R-squared
. That’s the square of the correlation coefficient \(R\). Us that information to compute \(R\) for our model.
Compute the variance (standard deviation squared) for the response variable width
. (You can use sd()
and square it, or you can use var()
to get the variance directly.)
var( ~width, data = KidsFeet)
## [1] 0.2597
Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe between the three variances? This relationship holds for all linear models.
var( ~fitted(KF_model))
## [1] 0.1067
var( ~resid(KF_model))
## [1] 0.1529
var( ~fitted(KF_model)) + var( ~resid(KF_model))
## [1] 0.2597
var( ~width, data = KidsFeet)
## [1] 0.2597
Now compute the ratio of the variance of the fitted values to the variance of the response (width
)….
var( ~fitted(KF_model)) / var( ~width, data = KidsFeet)
## [1] 0.411
If you just want the value of \(R^2\), you can get it by (a) squaring the correlation coefficient (use cor()
) or by using the rsquared()
function from the mosaic package. Try it both ways.
cor(width ~ length, data = KidsFeet)^2
## [1] 0.411
rsquared(KF_model)
## [1] 0.411
You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎
If there is missing data, those observations won’t have fits or residuals and get skipped. This is a pain. Often it is easiest to remove all the unneeded observations from the dataset before fitting the model.↩︎
If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎
I you want a slighly more minial summary, you can try msummary()
instead of summary()
.↩︎