Each time you start a section (including the first one) on this sheet, send a Teams message to Professor Pruim.
Let use the KidsFeet
data set, to predict foot width from foot length.
Create a scatter plot of the two variables. Which should go on the y-axis? Why?
Use %>% gf_lm()
to add a regression line to your scatter plot.
Using the lm()
function, fit a linear model to predict foot width from foot length.
What are the slope and intercept of the regression line?
Write down an equation for predicting width from length by filling in the blanks
\[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]
What does the hat on top of width mean in this equation?
Use your equation to predict the width of a foot that is 24 cm long.
Use your equation to predict the width of a foot that is 26 cm long.
What is the difference in these two predictions? Why?
Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)
(You might find the following useful: KidsFeet %>% View()
. You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)
If we save the result of lm()
, we can do some extra things.
Save your model as KF_model
KF_model <- lm(width ~ length, data = KidsFeet)
Compute the predicted values for every kid in the dataset using fitted(KF_model)
. What order are the results in? Find the fitted values for Julie and Caroline.
Compute the residuals for every kid in the dataset using resid(KF_model)
. Find the residuals for Julie and Caroline.
Compute the mean of the residuals using mean(resid(KF_model))
. What interesting value do you get?1
Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length
.
Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model)
. How does this scatter plot compare to the one you just made?
The intercept and slope of a regression line are called the coefficients. Give coef(KF_model)
a try and see what you get.
What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted()
.)
Here is a way to get R to compute the previous value.
predicted_width <- makeFun(KF_model)
predicted_width(length = 22)
## 1
## 8.317127
What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?
Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.
Why might residual plots be better for showing how well a line fits the data than the original scatter plot?
Compare plots A–D with plots W–Z. How do those two types of residual plots compare?2
Often you will see regression lines summarized with a table like the one produced by summary(KF_model)
. Give it a try. There will be more inforation there than we have learned about, but you should be able to locate the intercept and slope in such a summary.3
One of the things listed in that summary is labeld R-squared
. That’s the square of the correlation coefficient \(R\). Us that information to compute \(R\) for our model.
Compute the variance (standard deviation squared) for the response variable width
. (You can use sd()
and sqaure it, or you can use var()
to get the variance directly.)
Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe betwen the three variances? This relationship holds for all linear models.
Now compute the ratio of the variance of the fitted values to the variance of the response (width
).
You should see that this is exactly \(R^2\). That is,
\[ R^2 = \frac{s^2_{\mathrm{fitted}}}{s^2_{y}} = 1- \frac{s^2_{\mathrm{resid}}}{s^2_{y}} \]
We can interpret \(R^2\) as follows: It is the fraction of the variation in the response that is explained by or accounted for by the linear model. The rest \(1- R^2\) is not expained by or accounted for by the model. If \(R^2\) is 1, then all of the variability is accounted for by the model and the points fall exactly on the regression line.
If you just want the value of \(R^2\), you can get it by (a) sqaring the correlation coefficient (use cor()
) or by using the rsquared()
function from the mosaic package. Try it both ways.
lm()
– fit a linear model using least squares regression linefitted()
– compute the predicted value for each value of the predictor variableresid()
– compute the residuals for each value of the predictor variablemakeFun()
– create a function that can make predictions for any predictor valuecor()
– correlation coefficient (\(R\))rsquared()
– square of the correlation coefficient (sometimes called the coefficient of determinism)summary()
– summary information about a linear model.msummary()
– slightly more minimal summary information about a linear model.If you finish the things above, here is a bonus problem for you.
Suppose that out of a cohort of 120 patients with stage 1 lung cancer at the Dana-Farber Cancer Institute (DFCI) treated with a new surgical approach, 80 of the patients survive at least 5 years, and suppose that National Cancer Institute statistics indicate that the 5-year survival probability for stage 1 lung cancer patients nationally is 0.60. Do the data collected from 120 patients support the claim that the DFCI population treated with this new form of surgery has a different 5-year survival probability than the national population?
Why is this situation more like the Lady Tasting Tea than the malaria vaccine trial?
There is one difference between this and that lady tasting tea situation, however, what is it?
Express the null and alternative hypotheses for this situation. Do it his words and in mathematical notation.
See if you can figure out how to generate a null distribution for this situation.
Let use the KidsFeet
data set, to predict foot width from foot length.
Create a scatter plot of the two variables. Which should go on the y-axis? Why?
Explanatory on the x-axis, response on the y-axis.
Use %>% gf_lm()
to add a regression line to your scatter plot.
Using the lm()
function, fit a linear model to predict foot width from foot length.
What are the slope and intercept of the regression line?
Write down an equation for predicting width from length by filling in the blanks
\[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]
What does the hat on top of width mean in this equation?
Use your equation to predict the width of a foot that is 24 cm long.
Use your equation to predict the width of a foot that is 26 cm long.
What is the difference in these two predictions? Why?
Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)
(You might find the following useful: KidsFeet %>% View()
. You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)
So equation is \(\hat{width} = 2.86 + 0.247 \cdot \mathrm{length}\).
If foot length is 24cm: $ = 2.86 + 0.247 = 8.788.
If foot length is 26cm: $ = 2.86 + 0.247 = 9.282.
The difference is \(2 \cdot 0.247\), twice the slope. If you change the predictor by 2, the predicted response changes by 2 times the slope.
Julie and Corline are obsersvations 15 and 17. You can just scan for them in the data. Here is a way to get just those two rows. (pander()
does fancier printing.)
KidsFeet %>% filter(name %in% c('Julie', 'Caroline')) %>% pander()
name | birthmonth | birthyear | length | width | sex | biggerfoot | domhand |
---|---|---|---|---|---|---|---|
Julie | 11 | 87 | 26 | 9.3 | G | L | R |
Caroline | 12 | 87 | 24 | 8.7 | G | R | L |
So the residuals are \(9.3 - (2.86 + 0.247 * 24) = 0.512\) and \(9.3 - (2.86 + 0.247 * 26) = 0.018\).
If we save the result of lm()
, we can do some extra things.
Save your model as KF_model
KF_model <- lm(width ~ length, data = KidsFeet)
Compute the predicted values for every kid in the dataset using fitted(KF_model)
. What order are the results in? Find the fitted values for Julie and Caroline.
Compute the residuals for every kid in the dataset using resid(KF_model)
. Find the residuals for Julie and Caroline.
Compute the mean of the residuals using mean(resid(KF_model))
. What interesting value do you get?4
Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length
.
Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model)
. How does this scatter plot compare to the one you just made?
The intercept and slope of a regression line are called the coefficients. Give coef(KF_model)
a try and see what you get.
What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted()
.)
Here is a way to get R to compute the previous value.
predicted_width <- makeFun(KF_model)
predicted_width(length = 22)
## 1
## 8.317127
What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?
KF_model <-
lm(width ~ length, data = KidsFeet)
Fits and residuals:
fitted(KF_model)
## 1 2 3 4 5 6 7 8
## 8.912201 9.160149 8.936996 9.110560 9.085765 9.234534 9.333713 8.565075
## 9 10 11 12 13 14 15 16
## 8.713843 8.540280 9.680840 9.011381 9.333713 9.556866 9.308918 8.738638
## 17 18 19 20 21 22 23 24
## 8.813022 8.986586 9.482481 9.184944 8.813022 8.912201 8.813022 8.936996
## 25 26 27 28 29 30 31 32
## 8.862612 9.581660 9.333713 9.184944 8.862612 8.788228 8.813022 8.441101
## 33 34 35 36 37 38 39
## 8.936996 8.713843 8.986586 8.540280 9.308918 8.217948 8.961791
resid(KF_model)
## 1 2 3 4 5 6
## -0.51220149 -0.36014925 0.76300373 0.68944030 -0.18576493 0.46546642
## 7 8 9 10 11 12
## 0.26628731 0.23492537 0.58615672 0.25972015 0.11916045 -0.11138060
## 13 14 15 16 17 18
## -0.23371269 0.24313433 -0.00891791 -0.83863806 -0.11302239 -0.18658582
## 19 20 21 22 23 24
## -0.48248134 0.31505597 0.38697761 -0.31220149 -0.51302239 0.06300373
## 25 26 27 28 29 30
## -0.76261194 -0.18166045 0.16628731 0.31505597 0.03738806 0.51177239
## 31 32 33 34 35 36
## 0.48697761 0.15889925 -0.33699627 0.28615672 -0.38658582 -0.04027985
## 37 38 39
## -0.30891791 -0.31794776 -0.16179104
These two plots looks basically the same except for the axis labeling:
gf_point(resid(KF_model) ~ length, data = KidsFeet) %>%
gf_labs(title = "residuals vs explanatory variable")
gf_point(resid(KF_model) ~ fitted(KF_model), data = KidsFeet) %>%
gf_labs(title = "residuals vs fitted values")
Coefficients:
coef(KF_model) # coefficients (intecept and slope)
## (Intercept) length
## 2.8622761 0.2479478
Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.
Why might residual plots be better for showing how well a line fits the data than the original scatter plot?
Compare plots A–D with plots W–Z. How do those two types of residual plots compare?5
Often you will see regression lines summarized with a table like the one produced by summary(KF_model)
. Give it a try. There will be more information there than we have learned about, but you should be able to locate the intercept and slope in such a summary.6
One of the things listed in that summary is labeld R-squared
. That’s the square of the correlation coefficient \(R\). Us that information to compute \(R\) for our model.
Compute the variance (standard deviation squared) for the response variable width
. (You can use sd()
and sqaure it, or you can use var()
to get the variance directly.)
var( ~width, data = KidsFeet)
## [1] 0.2596761
Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe between the three variances? This relationship holds for all linear models.
var( ~fitted(KF_model))
## [1] 0.106728
var( ~resid(KF_model))
## [1] 0.1529482
var( ~fitted(KF_model)) + var( ~resid(KF_model))
## [1] 0.2596761
var( ~width, data = KidsFeet)
## [1] 0.2596761
Now compute the ratio of the variance of the fitted values to the variance of the response (width
).
You should see that this is exactly \(R^2\). That is,
\[ R^2 = \frac{s^2_{\mathrm{fitted}}}{s^2_{y}} = 1- \frac{s^2_{\mathrm{resid}}}{s^2_{y}} \]
We can interpret \(R^2\) as follows: It is the fraction of the variation in the response that is explained by or accounted for by the linear model. The rest \(1- R^2\) is not expained by or accounted for by the model. If \(R^2\) is 1, then all of the variability is accounted for by the model and the points fall exactly on the regression line.
var( ~fitted(KF_model)) / var( ~width, data = KidsFeet)
## [1] 0.4110041
cor(width ~ length, data = KidsFeet)^2
## [1] 0.4110041
rsquared(KF_model)
## [1] 0.4110041
If you just want the value of \(R^2\), you can get it by (a) sqaring the correlation coefficient (use cor()
) or by using the rsquared()
function from the mosaic package. Try it both ways.
You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎
If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎
I you want a slighly more minial summary, you can try msummary()
instead of summary()
.↩︎
You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎
If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎
I you want a slighly more minial summary, you can try msummary()
instead of summary()
.↩︎