Ulysses’ Compass (Chapter 7 in Rethinking)

Chapter 7 (which we skipped over) contends with two contrasting kinds of statistical error:

overfitting, “which leads to poor prediction by learning too much from the data”
underfitting, “which leads to poor prediction by learning too little from the data” [@mcelreathStatisticalRethinkingBayesian2020, p. 192, *emphasis* added]

Our job is to carefully navigate among these monsters. There are two common families of approaches. The first approach is to use a regularizing prior to tell the model not to get too excited by the data. This is the same device that non-Bayesian methods refer to as “penalized likelihood.” The second approach is to use some scoring device, like information criteria or cross-validation, to model the prediction task and estimate predictive accuracy. Both families of approaches are routinely used in the natural and social sciences. Furthermore, they can be–maybe should be–used in combination. So it’s worth understanding both, as you’re going to need both at some point. (p. 192, emphasis in the original)

Goals for Today

Introduce \(R^2\) and what it does and doesn’t tell us about our model.
See how making a model too complex can lead to overfitting.
Start thinking about how we might tell how well our model can be expected to fit new data.

If we could measure this, it would help us avoid underfitting or overfitting.

\(R^2\)

\(R^2\) is a popular way to measure how well a model fits the data used to train the model. This is sometimes called retrodiction (in comparision to prediction, which looks foward to new data).

It traditionally follows the form

\[R^2 = \frac{\text{var(outcome)} - \text{var(residuals)}}{\text{var(outcome)}} = 1 - \frac{\text{var(residuals)}}{\text{var(outcome)}}.\]

By \(\operatorname{var}()\), of course, we meant variance (i.e., what you get from the var() function in R). McElreath is not a fan of the \(R^2\). But you will see it a lot in some fields and it’s good to know what it does and doesn’t do.

Some Brains Data

We’ll start off by making the data with brain size and body size for seven species.

Brains <- 
  tibble(species = c("afarensis", "africanus", "habilis", "boisei", 
                     "rudolfensis", "ergaster", "sapiens"), 
         brain   = c(438, 452, 612, 521, 752, 871, 1350), 
         mass    = c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5))
Brains %>% pander()

species	brain	mass
afarensis	438	37
africanus	452	35.5
habilis	612	34.5
boisei	521	41.5
rudolfensis	752	55.5
ergaster	871	61
sapiens	1350	53.5

library(ggrepel)
library(ggformulaExtra)

Brains %>%
  gf_point(brain ~ mass) %>%
  gf_text_repel(label = ~species) %>%
  gf_labs(
    subtitle = "Average brain volume by body\nmass for six hominin species",
    x = "body mass (kg)",
    y = "brain volume (cc)") %>%
  gf_lims(x = c(30, 65))

Before fitting our models,

we want to standardize body mass–give it mean zero and standard deviation one–and rescale the outcome, brain volume, so that the largest observed value is 1. Why not standardize brain volume as well? Because we want to preserve zero as a reference point: No brain at all. You can’t have negative brain. I don’t think. (p. 195)

Brains <-
  Brains %>% 
  mutate(
    mass_std  = (mass - mean(mass)) / sd(mass),
    brain_std = brain / max(brain))

A simple model

Our first statistical model will follow the form

\[\begin{align*} \text{brain_std}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \beta \text{mass_std}_i \\ \alpha & \sim \operatorname{Normal}(0.5, 1) \\ \beta & \sim \operatorname{Normal}(0, 10) \\ \sigma & \sim \operatorname{Log-Normal}(0, 1). \end{align*}\]

This simply says that the average brain volume \(b_i\) of species \(i\) is a linear function of its body mass \(m_i\). Now consider what the priors imply. The prior for \(\alpha\) is just centered on the mean brain volume (rescaled) in the data. So it says that the average species with an average body mass has a brain volume with an 89% credible interval from about −1 to 2. That is ridiculously wide and includes impossible (negative) values. The prior for \(\beta\) is very flat and centered on zero. It allows for absurdly large positive and negative relationships. These priors allow for absurd inferences, especially as the model gets more complex. And that’s part of the lesson. (p. 196)

library(rethinking)

m7.1 <- quap( 
  alist(
    brain_std ~ dnorm(mu, exp(log_sigma)), 
    mu <- a + b * mass_std,
    a ~ dnorm(0.5, 1),
    b ~ dnorm(0, 10),
    log_sigma ~ dnorm(0, 1)
  ), 
  data = Brains)

Check the summary.

precis(m7.1)

##                 mean         sd        5.5%      94.5%
## a          0.5285412 0.06841558  0.41919992  0.6378826
## b          0.1671113 0.07406875  0.04873515  0.2854875
## log_sigma -1.7068544 0.29374043 -2.17630834 -1.2374005

Computing \(R^2\)

Although rethinking does not contain a convenience function for computing the \(R^2\) directly, McElreath showed us how to make a function for that purpose ourselves. Let’s make two modification to his R2_is_bad() function:

Use link() instead of sim(). Both should have the same average for these models, but link() will estiamate it more accurately.
Add a seed argument for reproducibility.

R2_is_bad <- function(quap_fit, seed = 7, ...) {
  set.seed(seed)
  l <- link(quap_fit, refresh = 0, ...)
  r <- Brains$brain_std - apply(l, 2, mean) 
  1 - var(r) / var(Brains$brain_std)
}

Here’s the estimate for our \(R^2\).

R2_is_bad(m7.1)

## [1] 0.490125

Do note that,

in principle, the Bayesian approach mandates that we do this for each sample from the posterior. But \(R^2\) is traditionally computed only at the mean prediction. So we’ll do that as well here. Later in the chapter you’ll learn a properly Bayesian score that uses the entire posterior distribution. (p. 197)

More parameters (almost) always improve fit

Now fit the quadratic through sixth-order polynomial models.

# quadratic
m7.2 <- quap( 
  alist(
    brain_std ~ dnorm(mu, exp(log_sigma)), 
    mu <- a + b[1] * mass_std + b[2] * mass_std^2,
    a ~ dnorm(0.5, 1),
    b ~ dnorm(0, 10),
    log_sigma ~ dnorm(0, 1)
  ), 
  data = Brains, start = list(b = rep(0, 2)))

# cubic
m7.3 <- quap( 
  alist(
    brain_std ~ dnorm(mu, exp(log_sigma)), 
    mu <- a + b[1] * mass_std + b[2] * mass_std^2 + b[3] * mass_std^3,
    a ~ dnorm(0.5, 1),
    b ~ dnorm(0, 10),
    log_sigma ~ dnorm(0, 1)
  ), 
  data = Brains, start = list(b = rep(0, 3)))

# fourth-order
m7.4 <- quap( 
  alist(
    brain_std ~ dnorm(mu, exp(log_sigma)), 
    mu <- a + b[1] * mass_std + b[2] * mass_std^2 + b[3] * mass_std^3 + b[4] * mass_std^4,
    a ~ dnorm(0.5, 1),
    b ~ dnorm(0, 10),
    log_sigma ~ dnorm(0, 1)
  ), 
  data = Brains, start = list(b = rep(0, 4)))

# fifth-order
m7.5 <- quap( 
  alist(
    brain_std ~ dnorm(mu, exp(log_sigma)), 
    mu <- a + b[1] * mass_std + b[2] * mass_std^2 + b[3] * mass_std^3 + b[4] * mass_std^4 + b[5] * mass_std^5,
    a ~ dnorm(0.5, 1),
    b ~ dnorm(0, 10),
    log_sigma ~ dnorm(0, 1)
  ), 
  data = Brains, start = list(b = rep(0, 5)))

# sixth-order
m7.6 <- quap( 
  alist(
    brain_std ~ dnorm(mu, 0.001), 
    mu <- a + b[1] * mass_std + b[2] * mass_std^2 + b[3] * mass_std^3 + b[4] * mass_std^4 + b[5] * mass_std^5 + b[6] * mass_std^6,
    a ~ dnorm(0.5, 1),
    b ~ dnorm(0, 10)
  ), 
  data = Brains, start = list(b = rep(0, 6)))

Lots of Plots

Here’s how we might plot the result for the first model, m7.1.

library(tidybayes)

mass_seq <- seq(from = -2, to = 2, length.out = 100) 

L <-
  link(m7.1, data = list(mass_std = mass_seq)) %>% 
  apply(2, mean_qi, .width = 0.9) %>%
  bind_rows() %>%
  mutate(mass_std = mass_seq)

gf_ribbon(ymin + ymax ~ mass_std, data = L) %>%
  gf_line(y ~ mass_std, data = L) %>%
  gf_point(brain_std ~ mass_std, data = Brains, inherit = FALSE) %>%
  gf_labs(
    subtitle = bquote(italic(R)^2==.(round(R2_is_bad(m7.1), digits = 2))),
    x = "body mass (standardized)",
    y = "brain volume (standardized)")

To slightly repurpose a quote from McElreath:

We’ll want to do this for the next several models, so let’s write a function to make it repeatable. If you find yourself writing code more than once, it is usually saner to write a function and call the function more than once instead. (p. 197)

Our make_figure7.3() function will wrap the simulation, data wrangling, and plotting code all in one. It takes two arguments, the first of which defines which fit object we’d like to plot. If you look closely at Figure 7.3 in the text, you’ll notice that the range of the \(y\)-axis changes in the last three plots. Our second argument, ylim, will allow us to vary those parameters across subplots.

make_figure7.3 <- function(quap_fit, ylim = c(0, 1) ) { # range(Brains$brain_std)) {
  
  title <- deparse(substitute(quap_fit))
  # compute the R2
  r2 <- R2_is_bad(quap_fit)
  
  # define the mass_seq 
  mass_seq <- seq(from = -2, to = 2, length.out = 100) 
  
  L <-
    link(quap_fit, data = list(mass_std = mass_seq)) %>% 
    apply(2, mean_qi, .width = 0.9) %>%
    bind_rows() %>%
    mutate(mass_std = mass_seq) 
  
  gf_ribbon(ymin + ymax ~ mass_std, data = L) %>%
    gf_line(y ~ mass_std, data = L) %>%
    gf_point(brain_std ~ mass_std, data = Brains, inherit = FALSE) %>%
    gf_labs(
      title = title,
      subtitle = bquote(italic(R)^2==.(round(R2_is_bad(quap_fit), digits = 2))),
      x = "body mass (standardized)",
      y = "brain volume (standardized)") %>%
    gf_refine(
      scale_y_continuous(limits = ylim),
      scale_x_continuous(limits = c(-1.2, 1.5)) 
    )
}

Here we make and save the six subplots in bulk.

p1 <- make_figure7.3(m7.1, ylim = c(-0.5, 1.5))
p2 <- make_figure7.3(m7.2, ylim = c(-0.5, 1.5))
p3 <- make_figure7.3(m7.3, ylim = c(-0.5, 1.5))
p4 <- make_figure7.3(m7.4, ylim = c(-0.5, 1.5))
p5 <- make_figure7.3(m7.5, ylim = c(-0.5, 1.5))
p6 <- make_figure7.3(m7.6, ylim = c(-0.5, 1.5)) %>%
  gf_hline(yintercept = ~ 0, color = "red", linetype = 2)

Now use patchwork syntax to bundle them all together.

library(patchwork)

((p1 | p2) / (p3 | p4) / (p5 | p6)) +
  plot_annotation(title = "Figure7.3. Polynomial linear models of increasing\ndegree for the hominin data.")

“This is a general phenomenon: If you adopt a model family with enough parameters, you can fit the data exactly. But such a model will make rather absurd predictions for yet-to-be-observed cases” (pp. 199–201).

Note: Not only do the overfitting models start doing crazy things, they become very confident in their predictions – what the model thinks and reality do not always match.

Too few parameters hurts, too.

The overfit polynomial models fit the data extremely well, but they suffer for this within-sample accuracy by making nonsensical out-of-sample predictions. In contrast, underfitting produces models that are inaccurate both within and out of sample. They learn too little, failing to recover regular features of the sample. (p. 201, emphasis in the original)

To explore the distinctions between overfitting and underfitting, we’ll need to refit m7.1 and m7.4 several times after serially dropping one of the rows in the data. You can filter() by row_number() to drop rows in a tidyverse kind of way. For example, we can drop the second row of d like this.

 Brains %>%
  mutate(row = 1:n()) %>% 
  filter(row_number() != 2)

## # A tibble: 6 x 6
##   species     brain  mass mass_std brain_std   row
##   <chr>       <dbl> <dbl>    <dbl>     <dbl> <int>
## 1 afarensis     438  37     -0.779     0.324     1
## 2 habilis       612  34.5   -1.01      0.453     3
## 3 boisei        521  41.5   -0.367     0.386     4
## 4 rudolfensis   752  55.5    0.917     0.557     5
## 5 ergaster      871  61      1.42      0.645     6
## 6 sapiens      1350  53.5    0.734     1         7

In his Overthinking: Dropping rows box (p. 202), McElreath encouraged us to take a look at the brain_loo_plot() function to get a sense of how he made his Figure 7.4. Here it is.

brain_loo_plot

function (fit, atx = c(35, 47, 60), aty = c(450, 900, 1300), 
    xlim, ylim, npts = 100) 
{
    post <- extract.samples(fit)
    n <- dim(post$b)[2]
    if (is.null(n)) 
        n <- 1
    if (missing(xlim)) 
        xlim <- range(d$mass_std)
    else xlim <- (xlim - mean(d$mass))/sd(d$mass)
    if (missing(ylim)) 
        ylim <- range(d$brain_std)
    else ylim <- ylim/max(d$brain)
    plot(d$brain_std ~ d$mass_std, xaxt = "n", yaxt = "n", xlab = "body mass (kg)", 
        ylab = "brain volume (cc)", col = rangi2, pch = 16, xlim = xlim, 
        ylim = ylim)
    axis_unscale(1, atx, d$mass)
    axis_unscale(2, at = aty, factor = max(d$brain))
    d <- as.data.frame(fit@data)
    for (i in 1:nrow(d)) {
        di <- d[-i, ]
        m_temp <- quap(fit@formula, data = di, start = list(b = rep(0, 
            n)))
        xseq <- seq(from = xlim[1] - 0.2, to = xlim[2] + 0.2, 
            length.out = npts)
        l <- link(m_temp, data = list(mass_std = xseq), refresh = 0)
        mu <- apply(l, 2, mean)
        lines(xseq, mu, lwd = 2, col = col.alpha("black", 0.3))
    }
    model_name <- deparse(match.call()[[2]])
    mtext(model_name, adj = 0)
}
<bytecode: 0x1110d81a0>
<environment: namespace:rethinking>

Though we’ll be taking a slightly different route than the one outlined in McElreath’s brain_loo_plot() function, we can glean some great insights. For example, here’s a tricky way to update a quap() model.

m7.1.1 <- 
  quap(m7.1@formula,
       data = filter(Brains, row_number() != 1))

precis(m7.1.1)

##                 mean         sd        5.5%      94.5%
## a          0.5423832 0.08033909  0.41398587  0.6707806
## b          0.1544341 0.08501069  0.01857057  0.2902976
## log_sigma -1.6318618 0.32048259 -2.14405492 -1.1196688

You can see by the data statement that m7.1.1 is fit on the Brains data after dropping the first row. Here’s how we might amend out plotting strategy from before to visualize the posterior mean for the model-implied trajectory.

link(m7.1.1, data = list(mass_std = mass_seq)) %>% 
  apply(2, mean_hdi) %>%
  bind_rows() %>%
  mutate(mass_std = mass_seq, row = 1) %>%
  gf_line(y ~ mass_std, size = 1/2, alpha = 1/2) %>%
  gf_point(brain_std ~ mass_std, data = Brains) %>%
  gf_labs(subtitle = "m7.1.1",
       x = "body mass (std)",
       y = "brain volume (std)") %>%
    gf_refine(
      coord_cartesian(xlim = range(Brains$mass_std),
                      ylim = range(Brains$brain_std))
    )

Now we can build off of those sensibilities to make an alternative to McElreath’s brain_loo_plot. Our brain_loo_lines() function will refit a model after dropping the row we define in the row argument. After fitting the model, it then extracts the fitted line over the previously-defined values of mass_seq.

brain_loo_lines <- function(quap_fit, row) {
 
  # determine the order of the polynomial
  n <- extract.samples(quap_fit)$b %>% ncol()
  if (is.null(n)) 
    n <- 1
  
  # refit the model
  fit <- quap(quap_fit@formula,
              data = filter(Brains, row_number() != row),
              start = list(b = rep(0, times = n)))
  
  # pull the lines values
  link(fit, data = list(mass_std = mass_seq)) %>% 
    apply(2, mean) %>%
    tibble(mean_brain_std = .) %>%
    mutate(mass_std = mass_seq, row = row) 
}

Here’s how brain_loo_lines() works.

brain_loo_lines(m7.1, row = 1) %>% 
  glimpse()

## Rows: 100
## Columns: 3
## $ mean_brain_std <dbl> 0.2476051, 0.2536389, 0.2596727, 0.2657065, 0.2717403, 0.2777741, 0.2838079…
## $ mass_std       <dbl> -2.0000000, -1.9595960, -1.9191919, -1.8787879, -1.8383838, -1.7979798, -1.…
## $ row            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Working within the tidyverse paradigm, we’ll make a tibble with the predefined row values. We will then use purrr::map() to plug those row values into brain_loo_lines(), which will return the desired posterior mean values for each corresponding value of mass_std. Then we pump those values into ggplot() and customize the appearance of our plots from there.

# This still needs to be converted to ggformula 

fig7.4 <- function(quap_fit) {
  title <- deparse(substitute(quap_fit))
  
  purrr::map(1:7, ~brain_loo_lines(quap_fit = quap_fit, row = .)) %>% 
  bind_rows() %>% 
    gf_line(mean_brain_std ~ mass_std, color = ~row, group = ~ row,
            size = 1/2, alpha = 1/2) %>%
    gf_point(brain_std ~ mass_std, data = Brains, color = ~(1:nrow(Brains)), 
             inherit = FALSE) %>%
    gf_labs(subtitle = title,
       x = "body mass (standardized)",
       y = "brain volume (standardized)") %>%
    gf_refine(
      coord_cartesian(xlim = range(Brains$mass_std),
                      ylim = range(Brains$brain_std))
    )
}

p1 <- fig7.4(quap_fit = m7.1)
p2 <- fig7.4(quap_fit = m7.2)
p1 + p2

Notice that the straight lines hardly vary, while the curves fly about wildly. This is a general contrast between underfit and overfit models: sensitivity to the exact composition of the sample used to fit the model (p. 201).

Where do we go from here?

We would like a way to estimate how well a model might do on new data (different from the data used to train the model, but drawn from the same population or process).

\(R^2\)

And a Preview of What’s Coming

Stat 341

Ulysses’ Compass (Chapter 7 in Rethinking)

Goals for Today

\(R^2\)

Some Brains Data

A simple model

Computing \(R^2\)

More parameters (almost) always improve fit

Lots of Plots

Too few parameters hurts, too.

Where do we go from here?