Each time you start a section (including the first one) on this sheet, send a Teams message to Professor Pruim.

Predicting Foot Width from Foot Length

Let use the KidsFeet data set, to predict foot width from foot length.

  1. Create a scatter plot of the two variables. Which should go on the y-axis? Why?

  2. Use %>% gf_lm() to add a regression line to your scatter plot.

  3. Using the lm() function, fit a linear model to predict foot width from foot length.

    1. What are the slope and intercept of the regression line?

    2. Write down an equation for predicting width from length by filling in the blanks

      \[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]

    3. What does the hat on top of width mean in this equation?

    4. Use your equation to predict the width of a foot that is 24 cm long.

    5. Use your equation to predict the width of a foot that is 26 cm long.

    6. What is the difference in these two predictions? Why?

    7. Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)

      (You might find the following useful: KidsFeet %>% View(). You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)

  4. If we save the result of lm(), we can do some extra things.

    1. Save your model as KF_model

      KF_model <- lm(width ~ length, data = KidsFeet)
    2. Compute the predicted values for every kid in the dataset using fitted(KF_model). What order are the results in? Find the fitted values for Julie and Caroline.

    3. Compute the residuals for every kid in the dataset using resid(KF_model). Find the residuals for Julie and Caroline.

    4. Compute the mean of the residuals using mean(resid(KF_model)). What interesting value do you get?1

    5. Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length.

    6. Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model). How does this scatter plot compare to the one you just made?

    7. The intercept and slope of a regression line are called the coefficients. Give coef(KF_model) a try and see what you get.

    8. What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted().)

    9. Here is a way to get R to compute the previous value.

      predicted_width <- makeFun(KF_model)
      predicted_width(length = 22)
      ##        1 
      ## 8.317127
    10. What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?

More about Residuals and Residual Plots

  1. Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.

  2. Why might residual plots be better for showing how well a line fits the data than the original scatter plot?

  3. Compare plots A–D with plots W–Z. How do those two types of residual plots compare?2

R-squared

  1. Often you will see regression lines summarized with a table like the one produced by summary(KF_model). Give it a try. There will be more inforation there than we have learned about, but you should be able to locate the intercept and slope in such a summary.3

  2. One of the things listed in that summary is labeld R-squared. That’s the square of the correlation coefficient \(R\). Us that information to compute \(R\) for our model.

  3. Compute the variance (standard deviation squared) for the response variable width. (You can use sd() and sqaure it, or you can use var() to get the variance directly.)

  4. Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe betwen the three variances? This relationship holds for all linear models.

  5. Now compute the ratio of the variance of the fitted values to the variance of the response (width).

    You should see that this is exactly \(R^2\). That is,

    \[ R^2 = \frac{s^2_{\mathrm{fitted}}}{s^2_{y}} = 1- \frac{s^2_{\mathrm{resid}}}{s^2_{y}} \]

    We can interpret \(R^2\) as follows: It is the fraction of the variation in the response that is explained by or accounted for by the linear model. The rest \(1- R^2\) is not expained by or accounted for by the model. If \(R^2\) is 1, then all of the variability is accounted for by the model and the points fall exactly on the regression line.

  6. If you just want the value of \(R^2\), you can get it by (a) sqaring the correlation coefficient (use cor()) or by using the rsquared() function from the mosaic package. Try it both ways.

Important R functions for linear models

Bonus Section

If you finish the things above, here is a bonus problem for you.

Suppose that out of a cohort of 120 patients with stage 1 lung cancer at the Dana-Farber Cancer Institute (DFCI) treated with a new surgical approach, 80 of the patients survive at least 5 years, and suppose that National Cancer Institute statistics indicate that the 5-year survival probability for stage 1 lung cancer patients nationally is 0.60. Do the data collected from 120 patients support the claim that the DFCI population treated with this new form of surgery has a different 5-year survival probability than the national population?

  1. Why is this situation more like the Lady Tasting Tea than the malaria vaccine trial?

  2. There is one difference between this and that lady tasting tea situation, however, what is it?

  3. Express the null and alternative hypotheses for this situation. Do it his words and in mathematical notation.

  4. See if you can figure out how to generate a null distribution for this situation.

  1. Describe how you might do it with a physical simulation (cards, coins, dice, etc.)
  2. How can you get R to do this lots of times for you?

Some Solutions

Predicting Foot Width from Foot Length

Let use the KidsFeet data set, to predict foot width from foot length.

  1. Create a scatter plot of the two variables. Which should go on the y-axis? Why?

    Explanatory on the x-axis, response on the y-axis.

  2. Use %>% gf_lm() to add a regression line to your scatter plot.

  3. Using the lm() function, fit a linear model to predict foot width from foot length.

    1. What are the slope and intercept of the regression line?

    2. Write down an equation for predicting width from length by filling in the blanks

      \[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]

    3. What does the hat on top of width mean in this equation?

    4. Use your equation to predict the width of a foot that is 24 cm long.

    5. Use your equation to predict the width of a foot that is 26 cm long.

    6. What is the difference in these two predictions? Why?

    7. Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)

      (You might find the following useful: KidsFeet %>% View(). You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)

    So equation is \(\hat{width} = 2.86 + 0.247 \cdot \mathrm{length}\).

    • If foot length is 24cm: $ = 2.86 + 0.247 = 8.788.

    • If foot length is 26cm: $ = 2.86 + 0.247 = 9.282.

    • The difference is \(2 \cdot 0.247\), twice the slope. If you change the predictor by 2, the predicted response changes by 2 times the slope.

    Julie and Corline are obsersvations 15 and 17. You can just scan for them in the data. Here is a way to get just those two rows. (pander() does fancier printing.)

    KidsFeet %>% filter(name %in% c('Julie', 'Caroline')) %>% pander()
    name birthmonth birthyear length width sex biggerfoot domhand
    Julie 11 87 26 9.3 G L R
    Caroline 12 87 24 8.7 G R L

    So the residuals are \(9.3 - (2.86 + 0.247 * 24) = 0.512\) and \(9.3 - (2.86 + 0.247 * 26) = 0.018\).

  4. If we save the result of lm(), we can do some extra things.

    1. Save your model as KF_model

      KF_model <- lm(width ~ length, data = KidsFeet)
    2. Compute the predicted values for every kid in the dataset using fitted(KF_model). What order are the results in? Find the fitted values for Julie and Caroline.

    3. Compute the residuals for every kid in the dataset using resid(KF_model). Find the residuals for Julie and Caroline.

    4. Compute the mean of the residuals using mean(resid(KF_model)). What interesting value do you get?4

    5. Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length.

    6. Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model). How does this scatter plot compare to the one you just made?

    7. The intercept and slope of a regression line are called the coefficients. Give coef(KF_model) a try and see what you get.

    8. What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted().)

    9. Here is a way to get R to compute the previous value.

      predicted_width <- makeFun(KF_model)
      predicted_width(length = 22)
      ##        1 
      ## 8.317127
    10. What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?

    KF_model <- 
      lm(width ~ length, data = KidsFeet) 

    Fits and residuals:

    fitted(KF_model)
    ##        1        2        3        4        5        6        7        8 
    ## 8.912201 9.160149 8.936996 9.110560 9.085765 9.234534 9.333713 8.565075 
    ##        9       10       11       12       13       14       15       16 
    ## 8.713843 8.540280 9.680840 9.011381 9.333713 9.556866 9.308918 8.738638 
    ##       17       18       19       20       21       22       23       24 
    ## 8.813022 8.986586 9.482481 9.184944 8.813022 8.912201 8.813022 8.936996 
    ##       25       26       27       28       29       30       31       32 
    ## 8.862612 9.581660 9.333713 9.184944 8.862612 8.788228 8.813022 8.441101 
    ##       33       34       35       36       37       38       39 
    ## 8.936996 8.713843 8.986586 8.540280 9.308918 8.217948 8.961791
    resid(KF_model)
    ##           1           2           3           4           5           6 
    ## -0.51220149 -0.36014925  0.76300373  0.68944030 -0.18576493  0.46546642 
    ##           7           8           9          10          11          12 
    ##  0.26628731  0.23492537  0.58615672  0.25972015  0.11916045 -0.11138060 
    ##          13          14          15          16          17          18 
    ## -0.23371269  0.24313433 -0.00891791 -0.83863806 -0.11302239 -0.18658582 
    ##          19          20          21          22          23          24 
    ## -0.48248134  0.31505597  0.38697761 -0.31220149 -0.51302239  0.06300373 
    ##          25          26          27          28          29          30 
    ## -0.76261194 -0.18166045  0.16628731  0.31505597  0.03738806  0.51177239 
    ##          31          32          33          34          35          36 
    ##  0.48697761  0.15889925 -0.33699627  0.28615672 -0.38658582 -0.04027985 
    ##          37          38          39 
    ## -0.30891791 -0.31794776 -0.16179104

    These two plots looks basically the same except for the axis labeling:

    gf_point(resid(KF_model) ~ length, data = KidsFeet) %>%
      gf_labs(title = "residuals vs explanatory variable")

    gf_point(resid(KF_model) ~ fitted(KF_model), data = KidsFeet) %>%
      gf_labs(title = "residuals vs fitted values")

    Coefficients:

    coef(KF_model)  # coefficients (intecept and slope)
    ## (Intercept)      length 
    ##   2.8622761   0.2479478

More about Residuals and Residual Plots

  1. Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.

  2. Why might residual plots be better for showing how well a line fits the data than the original scatter plot?

  3. Compare plots A–D with plots W–Z. How do those two types of residual plots compare?5

R-squared

  1. Often you will see regression lines summarized with a table like the one produced by summary(KF_model). Give it a try. There will be more information there than we have learned about, but you should be able to locate the intercept and slope in such a summary.6

  2. One of the things listed in that summary is labeld R-squared. That’s the square of the correlation coefficient \(R\). Us that information to compute \(R\) for our model.

  3. Compute the variance (standard deviation squared) for the response variable width. (You can use sd() and sqaure it, or you can use var() to get the variance directly.)

    var( ~width, data = KidsFeet)
    ## [1] 0.2596761
  4. Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe between the three variances? This relationship holds for all linear models.

    var( ~fitted(KF_model))
    ## [1] 0.106728
    var( ~resid(KF_model))
    ## [1] 0.1529482
    var( ~fitted(KF_model)) + var( ~resid(KF_model))
    ## [1] 0.2596761
    var( ~width, data = KidsFeet)
    ## [1] 0.2596761
  5. Now compute the ratio of the variance of the fitted values to the variance of the response (width).

    You should see that this is exactly \(R^2\). That is,

    \[ R^2 = \frac{s^2_{\mathrm{fitted}}}{s^2_{y}} = 1- \frac{s^2_{\mathrm{resid}}}{s^2_{y}} \]

    We can interpret \(R^2\) as follows: It is the fraction of the variation in the response that is explained by or accounted for by the linear model. The rest \(1- R^2\) is not expained by or accounted for by the model. If \(R^2\) is 1, then all of the variability is accounted for by the model and the points fall exactly on the regression line.

    var( ~fitted(KF_model)) / var( ~width, data = KidsFeet)
    ## [1] 0.4110041
    cor(width ~ length, data = KidsFeet)^2
    ## [1] 0.4110041
    rsquared(KF_model)
    ## [1] 0.4110041
  6. If you just want the value of \(R^2\), you can get it by (a) sqaring the correlation coefficient (use cor()) or by using the rsquared() function from the mosaic package. Try it both ways.


  1. You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎

  2. If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎

  3. I you want a slighly more minial summary, you can try msummary() instead of summary().↩︎

  4. You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎

  5. If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎

  6. I you want a slighly more minial summary, you can try msummary() instead of summary().↩︎