Each time you start a section (including the first one) on this sheet, send a Teams message to Professor Pruim.

Predicting Foot Width from Foot Length

Let use the KidsFeet data set, to predict foot width from foot length.

Create a scatter plot of the two variables. Which should go on the y-axis? Why?
Use %>% gf_lm() to add a regression line to your scatter plot.
Using the lm() function, fit a linear model to predict foot width from foot length.
1. What are the slope and intercept of the regression line?
2. Write down an equation for predicting width from length by filling in the blanks
  
  \[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]
3. What does the hat on top of width mean in this equation?
4. Use your equation to predict the width of a foot that is 24 cm long.
5. Use your equation to predict the width of a foot that is 26 cm long.
6. What is the difference in these two predictions? Why?
7. Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)
  
  (You might find the following useful: KidsFeet %>% View(). You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)
If we save the result of lm(), we can do some extra things.
1. Save your model as KF_model
```
KF_model <- lm(width ~ length, data = KidsFeet)
```
2. Compute the predicted values for every kid in the dataset using fitted(KF_model). What order are the results in? Find the fitted values for Julie and Caroline.
3. Compute the residuals for every kid in the dataset using resid(KF_model). Find the residuals for Julie and Caroline.
4. Compute the mean of the residuals using mean(resid(KF_model)). What interesting value do you get?¹
5. Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length.
6. Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model). How does this scatter plot compare to the one you just made?
7. The intercept and slope of a regression line are called the coefficients. Give coef(KF_model) a try and see what you get.
8. What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted().)
9. Here is a way to get R to compute the previous value.
```
predicted_width <- makeFun(KF_model)
predicted_width(length = 22)
```
```
##        1 
## 8.317127
```
10. What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?

More about Residuals and Residual Plots

Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.
Why might residual plots be better for showing how well a line fits the data than the original scatter plot?
Compare plots A–D with plots W–Z. How do those two types of residual plots compare?²

R-squared

Often you will see regression lines summarized with a table like the one produced by summary(KF_model). Give it a try. There will be more inforation there than we have learned about, but you should be able to locate the intercept and slope in such a summary.³
One of the things listed in that summary is labeld R-squared. That’s the square of the correlation coefficient $R$. Us that information to compute $R$ for our model.
Compute the variance (standard deviation squared) for the response variable width. (You can use sd() and sqaure it, or you can use var() to get the variance directly.)
Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe betwen the three variances? This relationship holds for all linear models.
Now compute the ratio of the variance of the fitted values to the variance of the response (width).

You should see that this is exactly $R^2$. That is,

\[ R^2 = \frac{s^2_{\mathrm{fitted}}}{s^2_{y}} = 1- \frac{s^2_{\mathrm{resid}}}{s^2_{y}} \]

We can interpret $R^2$ as follows: It is the fraction of the variation in the response that is explained by or accounted for by the linear model. The rest $1- R^2$ is not expained by or accounted for by the model. If $R^2$ is 1, then all of the variability is accounted for by the model and the points fall exactly on the regression line.
If you just want the value of $R^2$, you can get it by (a) sqaring the correlation coefficient (use cor()) or by using the rsquared() function from the mosaic package. Try it both ways.

Important R functions for linear models

lm() – fit a linear model using least squares regression line
fitted() – compute the predicted value for each value of the predictor variable
resid() – compute the residuals for each value of the predictor variable
makeFun() – create a function that can make predictions for any predictor value
cor() – correlation coefficient ($R$)
rsquared() – square of the correlation coefficient (sometimes called the coefficient of determinism)
summary() – summary information about a linear model.
msummary() – slightly more minimal summary information about a linear model.

Bonus Section

If you finish the things above, here is a bonus problem for you.

Suppose that out of a cohort of 120 patients with stage 1 lung cancer at the Dana-Farber Cancer Institute (DFCI) treated with a new surgical approach, 80 of the patients survive at least 5 years, and suppose that National Cancer Institute statistics indicate that the 5-year survival probability for stage 1 lung cancer patients nationally is 0.60. Do the data collected from 120 patients support the claim that the DFCI population treated with this new form of surgery has a different 5-year survival probability than the national population?

Why is this situation more like the Lady Tasting Tea than the malaria vaccine trial?
There is one difference between this and that lady tasting tea situation, however, what is it?
Express the null and alternative hypotheses for this situation. Do it his words and in mathematical notation.
See if you can figure out how to generate a null distribution for this situation.

Describe how you might do it with a physical simulation (cards, coins, dice, etc.)
How can you get R to do this lots of times for you?

Some Solutions

Predicting Foot Width from Foot Length

Let use the KidsFeet data set, to predict foot width from foot length.

Create a scatter plot of the two variables. Which should go on the y-axis? Why?

Explanatory on the x-axis, response on the y-axis.
Use %>% gf_lm() to add a regression line to your scatter plot.
Using the lm() function, fit a linear model to predict foot width from foot length.
1. What are the slope and intercept of the regression line?
2. Write down an equation for predicting width from length by filling in the blanks
  
  \[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]
3. What does the hat on top of width mean in this equation?
4. Use your equation to predict the width of a foot that is 24 cm long.
5. Use your equation to predict the width of a foot that is 26 cm long.
6. What is the difference in these two predictions? Why?
7. Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)
  
  (You might find the following useful: KidsFeet %>% View(). You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)
So equation is $\hat{width} = 2.86 + 0.247 \cdot \mathrm{length}$.
- If foot length is 24cm: $ = 2.86 + 0.247 = 8.788.
- If foot length is 26cm: $ = 2.86 + 0.247 = 9.282.
- The difference is $2 \cdot 0.247$, twice the slope. If you change the predictor by 2, the predicted response changes by 2 times the slope.
Julie and Corline are obsersvations 15 and 17. You can just scan for them in the data. Here is a way to get just those two rows. (pander() does fancier printing.)
```
KidsFeet %>% filter(name %in% c('Julie', 'Caroline')) %>% pander()
```
name birthmonth birthyear length width sex biggerfoot domhand

Julie 11 87 26 9.3 G L R

Caroline 12 87 24 8.7 G R L

So the residuals are $9.3 - (2.86 + 0.247 * 24) = 0.512$ and $9.3 - (2.86 + 0.247 * 26) = 0.018$.

name	birthmonth	birthyear	length	width	sex	biggerfoot	domhand
Julie	11	87	26	9.3	G	L	R
Caroline	12	87	24	8.7	G	R	L

If we save the result of lm(), we can do some extra things.

Save your model as KF_model

KF_model <- lm(width ~ length, data = KidsFeet)

Compute the predicted values for every kid in the dataset using fitted(KF_model). What order are the results in? Find the fitted values for Julie and Caroline.
Compute the residuals for every kid in the dataset using resid(KF_model). Find the residuals for Julie and Caroline.
Compute the mean of the residuals using mean(resid(KF_model)). What interesting value do you get?⁴
Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length.
Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model). How does this scatter plot compare to the one you just made?
The intercept and slope of a regression line are called the coefficients. Give coef(KF_model) a try and see what you get.
What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted().)

Here is a way to get R to compute the previous value.

predicted_width <- makeFun(KF_model)
predicted_width(length = 22)

##        1 
## 8.317127

What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?

KF_model <- 
  lm(width ~ length, data = KidsFeet)

Fits and residuals:

fitted(KF_model)

##        1        2        3        4        5        6        7        8 
## 8.912201 9.160149 8.936996 9.110560 9.085765 9.234534 9.333713 8.565075 
##        9       10       11       12       13       14       15       16 
## 8.713843 8.540280 9.680840 9.011381 9.333713 9.556866 9.308918 8.738638 
##       17       18       19       20       21       22       23       24 
## 8.813022 8.986586 9.482481 9.184944 8.813022 8.912201 8.813022 8.936996 
##       25       26       27       28       29       30       31       32 
## 8.862612 9.581660 9.333713 9.184944 8.862612 8.788228 8.813022 8.441101 
##       33       34       35       36       37       38       39 
## 8.936996 8.713843 8.986586 8.540280 9.308918 8.217948 8.961791

resid(KF_model)

##           1           2           3           4           5           6 
## -0.51220149 -0.36014925  0.76300373  0.68944030 -0.18576493  0.46546642 
##           7           8           9          10          11          12 
##  0.26628731  0.23492537  0.58615672  0.25972015  0.11916045 -0.11138060 
##          13          14          15          16          17          18 
## -0.23371269  0.24313433 -0.00891791 -0.83863806 -0.11302239 -0.18658582 
##          19          20          21          22          23          24 
## -0.48248134  0.31505597  0.38697761 -0.31220149 -0.51302239  0.06300373 
##          25          26          27          28          29          30 
## -0.76261194 -0.18166045  0.16628731  0.31505597  0.03738806  0.51177239 
##          31          32          33          34          35          36 
##  0.48697761  0.15889925 -0.33699627  0.28615672 -0.38658582 -0.04027985 
##          37          38          39 
## -0.30891791 -0.31794776 -0.16179104

These two plots looks basically the same except for the axis labeling:

gf_point(resid(KF_model) ~ length, data = KidsFeet) %>%
  gf_labs(title = "residuals vs explanatory variable")

gf_point(resid(KF_model) ~ fitted(KF_model), data = KidsFeet) %>%
  gf_labs(title = "residuals vs fitted values")

Coefficients:

coef(KF_model)  # coefficients (intecept and slope)

## (Intercept)      length 
##   2.8622761   0.2479478

More about Residuals and Residual Plots

Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.
Why might residual plots be better for showing how well a line fits the data than the original scatter plot?
Compare plots A–D with plots W–Z. How do those two types of residual plots compare?⁵

R-squared

Often you will see regression lines summarized with a table like the one produced by summary(KF_model). Give it a try. There will be more information there than we have learned about, but you should be able to locate the intercept and slope in such a summary.⁶
One of the things listed in that summary is labeld R-squared. That’s the square of the correlation coefficient $R$. Us that information to compute $R$ for our model.
Compute the variance (standard deviation squared) for the response variable width. (You can use sd() and sqaure it, or you can use var() to get the variance directly.)
```
var( ~width, data = KidsFeet)
```
```
## [1] 0.2596761
```
Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe between the three variances? This relationship holds for all linear models.
```
var( ~fitted(KF_model))
```
```
## [1] 0.106728
```
```
var( ~resid(KF_model))
```
```
## [1] 0.1529482
```
```
var( ~fitted(KF_model)) + var( ~resid(KF_model))
```
```
## [1] 0.2596761
```
```
var( ~width, data = KidsFeet)
```
```
## [1] 0.2596761
```
Now compute the ratio of the variance of the fitted values to the variance of the response (width).

You should see that this is exactly $R^2$. That is,

\[ R^2 = \frac{s^2_{\mathrm{fitted}}}{s^2_{y}} = 1- \frac{s^2_{\mathrm{resid}}}{s^2_{y}} \]

We can interpret $R^2$ as follows: It is the fraction of the variation in the response that is explained by or accounted for by the linear model. The rest $1- R^2$ is not expained by or accounted for by the model. If $R^2$ is 1, then all of the variability is accounted for by the model and the points fall exactly on the regression line.
```
var( ~fitted(KF_model)) / var( ~width, data = KidsFeet)
```
```
## [1] 0.4110041
```
```
cor(width ~ length, data = KidsFeet)^2
```
```
## [1] 0.4110041
```
```
rsquared(KF_model)
```
```
## [1] 0.4110041
```
If you just want the value of $R^2$, you can get it by (a) sqaring the correlation coefficient (use cor()) or by using the rsquared() function from the mosaic package. Try it both ways.

You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎
If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎
I you want a slighly more minial summary, you can try msummary() instead of summary().↩︎
You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎
If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎
I you want a slighly more minial summary, you can try msummary() instead of summary().↩︎

Regression: Predictions and Residuals

Stat 145

Predicting Foot Width from Foot Length

More about Residuals and Residual Plots

R-squared

Important R functions for linear models

Bonus Section

Some Solutions

Predicting Foot Width from Foot Length

More about Residuals and Residual Plots

R-squared