Some solutions and comments

Predicting Foot Width from Foot Length

Let use the KidsFeet data set, to predict foot width from foot length.

  1. Create a scatter plot of the two variables. Which should go on the y-axis? Why?

    Explanatory on the x-axis, response on the y-axis.

  2. Use %>% gf_lm() to add a regression line to your scatter plot.

    gf_point(width ~ length, data = KidsFeet) %>% gf_lm()

  3. Using the lm() function, fit a linear model to predict foot width from foot length.

    1. What are the slope and intercept of the regression line?

    2. Write down an equation for predicting width from length by filling in the blanks

      \[ \hat{\mbox{width}} = \ \rule{0.5in}{0.3pt} \ + \ \rule{0.5in}{0.3pt}\ \cdot \mbox{length} \]

    3. What does the hat on top of width mean in this equation?

    4. Use your equation to predict the width of a foot that is 24 cm long.

    5. Use your equation to predict the width of a foot that is 26 cm long.

    6. What is the difference in these two predictions? Why?

    7. Now compute the residuals for Julie and Caroline. (Remember: residual = observed - predicted.)

      (You might find the following useful: KidsFeet %>% View(). You can get the same thing by clicking the little icon that looks like a spreadsheet in the Enviroment pane.)

    lm(width ~ length, data = KidsFeet) 
    ## 
    ## Call:
    ## lm(formula = width ~ length, data = KidsFeet)
    ## 
    ## Coefficients:
    ## (Intercept)       length  
    ##       2.862        0.248

    So equation is \(\hat{\mathrm{width}} = 2.86 + 0.248 \cdot \mathrm{length}\).

    • If foot length is 24cm: \(\hat{\mathrm{width}} = 2.86 + 0.248 \cdot 24 = 8.812\).

    • If foot length is 26cm: \(\hat{\mathrm{width}} = 2.86 + 0.248 \cdot 26 = 9.308\).

    • The difference is \(2 \cdot 0.248\), twice the slope. If you change the predictor by 2, the predicted response changes by 2 times the slope.

    Julie and Corline are obsersvations 15 and 17. You can just scan for them in the data. Here is a way to get just those two rows. (pander() does fancier printing.)

    KidsFeet %>% 
      filter(name %in% c('Julie', 'Caroline')) %>% 
      pander()
    name birthmonth birthyear length width sex biggerfoot domhand
    Julie 11 87 26 9.3 G L R
    Caroline 12 87 24 8.7 G R L

    So the residuals are

    • \(9.3 - (2.86 + 0.248 * 24) = 0.488\), and
    • \(9.3 - (2.86 + 0.248 * 26) = -0.008\).
  4. If we save the result of lm(), we can do some extra things.

    1. Save your model as KF_model

      KF_model <- lm(width ~ length, data = KidsFeet)
    2. Compute the predicted values for every kid in the dataset using fitted(KF_model). What order are the results in? Find the fitted values for Julie and Caroline.

    3. Compute the residuals for every kid in the dataset using resid(KF_model). Find the residuals for Julie and Caroline.

    4. Compute the mean of the residuals using mean(resid(KF_model)). What interesting value do you get?1

      mean(resid(KF_model))
      ## [1] -1.388e-17
    5. Residual plots are scatter plots with the residuals on one axis. Create a scatter plot of residuals vs length (the predictor) using the formula resid(KF_model) ~ length.

    6. Create a scatter plot of residuals vs fits using the formula resid(KF_mnodel) ~ fitted(KF_model). How does this scatter plot compare to the one you just made?

    7. The intercept and slope of a regression line are called the coefficients. Give coef(KF_model) a try and see what you get.

    8. What foot width does the linear model predict for a foot that is 22 cm long? (Note: none of the kids has a foot that long, so you won’t be able to use fitted().)

    9. Here is a way to get R to compute the previous value.

      predicted_width <- makeFun(KF_model)
      predicted_width(length = 22)
      ##     1 
      ## 8.317
    10. What foot width does this linear model predict if the length is 15 cm? Why is this prediction less reliable than the previous prediction?

    KF_model <- 
      lm(width ~ length, data = KidsFeet) 

    Fits and residuals are in the same order as the original data2, so we want items 15 and 17:

    fitted(KF_model)
    ##     1     2     3     4     5     6     7     8     9    10    11    12    13 
    ## 8.912 9.160 8.937 9.111 9.086 9.235 9.334 8.565 8.714 8.540 9.681 9.011 9.334 
    ##    14    15    16    17    18    19    20    21    22    23    24    25    26 
    ## 9.557 9.309 8.739 8.813 8.987 9.482 9.185 8.813 8.912 8.813 8.937 8.863 9.582 
    ##    27    28    29    30    31    32    33    34    35    36    37    38    39 
    ## 9.334 9.185 8.863 8.788 8.813 8.441 8.937 8.714 8.987 8.540 9.309 8.218 8.962
    resid(KF_model)
    ##         1         2         3         4         5         6         7         8 
    ## -0.512201 -0.360149  0.763004  0.689440 -0.185765  0.465466  0.266287  0.234925 
    ##         9        10        11        12        13        14        15        16 
    ##  0.586157  0.259720  0.119160 -0.111381 -0.233713  0.243134 -0.008918 -0.838638 
    ##        17        18        19        20        21        22        23        24 
    ## -0.113022 -0.186586 -0.482481  0.315056  0.386978 -0.312201 -0.513022  0.063004 
    ##        25        26        27        28        29        30        31        32 
    ## -0.762612 -0.181660  0.166287  0.315056  0.037388  0.511772  0.486978  0.158899 
    ##        33        34        35        36        37        38        39 
    ## -0.336996  0.286157 -0.386586 -0.040280 -0.308918 -0.317948 -0.161791
    fitted(KF_model)[15]
    ##    15 
    ## 9.309
    fitted(KF_model)[17]
    ##    17 
    ## 8.813
    resid(KF_model)[15]
    ##        15 
    ## -0.008918
    resid(KF_model)[17]
    ##     17 
    ## -0.113

    These two plots looks basically the same except for the axis labeling:

    gf_point(resid(KF_model) ~ length, data = KidsFeet) %>%
      gf_labs(title = "residuals vs explanatory variable")

    gf_point(resid(KF_model) ~ fitted(KF_model), data = KidsFeet) %>%
      gf_labs(title = "residuals vs fitted values")

    Coefficients:

    coef(KF_model)  # coefficients (intercept and slope)
    ## (Intercept)      length 
    ##      2.8623      0.2479

More about Residuals and Residual Plots

  1. Below are four scatter plots, four residuals vs predictor plots, and four residuals vs fits plots for the same four data sets. Identify which are which and match them up.

    Here is the matching

    • 1 – D – X
    • 2 – B – W
    • 3 – C – Z
    • 4 – A – Y
  2. Why might residual plots be better for showing how well a line fits the data than the original scatter plot?

    The use the space better, essentially maginfying things so we can see them better.

  3. Compare plots A–D with plots W–Z. How do those two types of residual plots compare?3

    Notice that A and Y and C and Z are mirror images of each other. That happens when the slope is negative. Otherwise things are identical (in pairs) except for the axis labeling.

R-squared

  1. Often you will see regression lines summarized with a table like the one produced by summary(KF_model). Give it a try. There will be more information there than we have learned about, but you should be able to locate the intercept and slope in such a summary.4

  2. One of the things listed in that summary is labeled R-squared. That’s the square of the correlation coefficient \(R\). Us that information to compute \(R\) for our model.

  3. Compute the variance (standard deviation squared) for the response variable width. (You can use sd() and square it, or you can use var() to get the variance directly.)

    var( ~width, data = KidsFeet)
    ## [1] 0.2597
  4. Now compute the variance of the fitted values and the variance of the residuals. What relationship do you observe between the three variances? This relationship holds for all linear models.

    var( ~fitted(KF_model))
    ## [1] 0.1067
    var( ~resid(KF_model))
    ## [1] 0.1529
    var( ~fitted(KF_model)) + var( ~resid(KF_model))
    ## [1] 0.2597
    var( ~width, data = KidsFeet)
    ## [1] 0.2597
  5. Now compute the ratio of the variance of the fitted values to the variance of the response (width)….

    var( ~fitted(KF_model)) / var( ~width, data = KidsFeet)
    ## [1] 0.411
  6. If you just want the value of \(R^2\), you can get it by (a) squaring the correlation coefficient (use cor()) or by using the rsquared() function from the mosaic package. Try it both ways.

    cor(width ~ length, data = KidsFeet)^2
    ## [1] 0.411
    rsquared(KF_model)
    ## [1] 0.411

  1. You should see that the value is essentially 0. The is true for every linear model. The average residual is always 0.↩︎

  2. If there is missing data, those observations won’t have fits or residuals and get skipped. This is a pain. Often it is easiest to remove all the unneeded observations from the dataset before fitting the model.↩︎

  3. If you are wondering why we have both types, the main reason is that the two types are more different (and reveal different information about the model) in models with multiple predictors,↩︎

  4. I you want a slighly more minial summary, you can try msummary() instead of summary().↩︎