Code from Statistical Rethinking modified by R Pruim is shown below. Differences to the oringal include:

Overfitting

Brains Data

This small data set giving the brain volume (cc) and body mass (kg) for several species. It is used to illustrate a very bad idea – improving the “fit” by increasing the degree of the polynomial used to model the relationship between brain size and body mass.

R code 6.1

Brains <- 
  data.frame(
    species =  c("afarensis", "africanus", "habilis", "boisei",
                 "rudolfensis", "ergaster", "sapiens"),
    brain_size = c(438, 452, 612, 521, 752, 871, 1350),
    body_mass =  c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5)
  )
gf_point(brain_size ~ body_mass, data = Brains, 
         size = 2, color = "red", alpha = 0.6, verbose = TRUE)  %>%
  gf_text(brain_size ~ body_mass + label:species, alpha = 0.8, 
          color = "navy", size = 3, angle = 30)
## ggplot(data = Brains) + 
##   geom_point(aes(y = brain_size, x = body_mass), size = 2, color = "red", alpha = 0.6)

R code 6.2

See Chapter 5 for more about lm() (and Bayesian versions of lm()). The model being fit below is

\[\begin{align*} \mbox{brain_size} & \sim \mathrm{Norm}(\mu, \sigma) \\ \mu & \sim a + b \cdot \mbox{body_mass} \end{align*}\]

There are no priors because \lm() isn’t using a Bayesian approach.
We’re using \lm() here because it is faster to fit an the syntax is terser, but the same principle would be illustrated if we used a Bayesian linear model instead.

m6.1 <- lm(brain_size ~ body_mass, data = Brains)

(Note: \lm() fits the parameters using “maximum likelihood”. You can think of this as using uniform priors on the coefficients, which means that the posterior is proportional to the likelihood, and maximum likelihood estimates are the same as the MAP estimates. The estimate for \(\sigma\) that \lm() uses is modified to make it an unbiased estimator.)

Measuring fit with \(r^2\)

R code 6.3

\(r^2\) can be defined several equivalent ways. Here are using the “proportion of variation that is ‘explained’ by the model”. The residual variance is unexplained. The variance of the response variable is our denominator.

1 - var(resid(m6.1)) / var(Brains$brain_size)
## [1] 0.490158
rsquared(m6.1)
## [1] 0.490158

R code 6.4

This model uses a quadratic relationship.

m6.2 <- lm(brain_size ~ poly(body_mass,2), data = Brains)
rsquared(m6.2)
## [1] 0.5359967

R code 6.5

We can use any degree polynomial in the same way.

m6.1 <- lm(brain_size ~ poly(body_mass, 1), data = Brains)
m6.2 <- lm(brain_size ~ poly(body_mass, 2), data = Brains)
m6.3 <- lm(brain_size ~ poly(body_mass, 3), data = Brains)
m6.4 <- lm(brain_size ~ poly(body_mass, 4), data = Brains)
m6.5 <- lm(brain_size ~ poly(body_mass, 5), data = Brains)
m6.6 <- lm(brain_size ~ poly(body_mass, 6), data = Brains)

poly(body_mass, k) creates a degree \(k\) polynomial in \(k\) (but parameterized in a special way that makes some kinds of statistical analysis easier – we are concerned with the particular parameterization here, just the overall model fit).

R code 6.6

And finally, here is a degree 0 polynomial (a constant).

m6.7 <- lm(brain_size ~ 1, data = Brains)

Leave One Out Analysis

R code 6.7

Here’s how you remove one row from a data set.

Brains.new <- Brains[-2, ]

One simple version of cross-validation is to fit the model several times, but each time leaving out one observation (hence the name “leave one out”). We can compare these models to each other to see how stable/volitile the moel fits are and to see how well the “odd one out” is predicted from the remaining observations.

R code 6.8

leave_one_out <-
  function(index = 1, degree = 1, ylim = c(0, NA)) {
    for(i in index)
      for(d in degree) {
        gf_point(brain_size ~ body_mass + color:remove, 
                 data = Brains %>% mutate(remove = 1:nrow(Brains) %in% i)) %>%
          gf_smooth(brain_size ~ body_mass, formula = y ~ poly(x, d), 
                    data = Brains[-i, ], method = "lm") %>% 
          gf_labs(title = paste("removed:", i, " ;  degree =", d)) %>%
          gf_lims(y = ylim) %>%
          print()
      }
  }

The simple linear model changes only slightly when we remove each data point (although the model’s uncertainty decreases quite a bit when we remove the data point that is least like the others).

leave_one_out(1:nrow(Brains), degree = 1, ylim = c(-2200, 4000))

Cubic models and their uncertainties change more – they are more sensitive to the data.

leave_one_out(1:nrow(Brains), degree = 3, ylim = c(-2200, 4000))

With a 5th degree polynomial (6 coefficients), the fit to the six data points is “perfect”, but highly volitile. The model has no uncertainty, but it is overfitting and overconfident.
The fit to the omitted point might not be very reliable.

leave_one_out(1:nrow(Brains), degree = 5, ylim = c(-2200, 4000))
## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

## Warning in qt((1 - level)/2, df): NaNs produced

Measuring fit

\(R^2\)

Using \(R^2\) alone as a measure of fit has the problem that \(R^2\) increases as we add complexity to the model, which pushes us toward overfitting.

Brains.R2 <-
  data_frame(
    degree = 0:6,
    R2 = sapply( list(m6.7, m6.1, m6.2, m6.3, m6.4, m6.5, m6.6), rsquared)
  )
Brains.R2
degree R2
0 0.0000000
1 0.4901580
2 0.5359967
3 0.6797736
4 0.8144339
5 0.9888540
6 1.0000000
gf_point(R2 ~ degree, data = Brains.R2) %>%
  gf_line(R2 ~ degree, data = Brains.R2, alpha = 0.5) %>%
  gf_labs(x = "degree of polynomial", y = expression(R^2))

There are ways to adjust \(R^2\) to reduce this problem, but we are going to introduce other methods of measuring fit.

gf_point(R2 ~ degree + color::factor(degree), data = Brains.R2) %>%
  gf_line(R2 ~ degree, data = Brains.R2, alpha = 0.5) %>%
  gf_segment(R2 + 1 ~ degree + 7 + color::factor(degree), 
             data = Brains.R2, alpha = 0.3) %>%
  gf_labs(x = "degree of polynomial", y = expression(R^2))

Weather Prediction Accuracy

Consider the predictions of two weather people over the same set of 10 days. Which one did a better job of predicting? How should we measure this?

  • First Weather Person:

  • Second Weather Person:

Last time we discussed some ways to compare which weather person makes the best predictions. Here is one more: Given each weather person’s “model” as a means of generating data, which one makes the observed weather most likely? Now weather person 1 wins handily:

# WP #1
1^3 * 0.4^7
## [1] 0.0016384
# WP #2 -- no chance!
0^3 * 1^7
## [1] 0

This has two advantages for us:

  1. This is just the likelihood, an important part of our Bayesian modeling system.

  2. It is based on joint probability rather than average probability. Weather person 2 is taking unfair advantage of average probability by making predictions we know are “impossible”.

Comparing Milk Models

R code 6.21

data(milk)
MilkCC <- milk %>% filter(complete.cases(.)) %>%
  mutate(
    neocortex = neocortex.perc / 100
  )
dim(MilkCC)
## [1] 17  9

R code 6.22

a.start <- mean(MilkCC$kcal.per.g)
sigma.start <- log(sd(MilkCC$kcal.per.g))
m6.11 <- map(
  alist(kcal.per.g ~ dnorm(a, exp(log.sigma))),
  data = MilkCC,
  start = list(a = a.start, log.sigma = sigma.start)
)
m6.12 <- map(
  alist(kcal.per.g ~ dnorm(mu, exp(log.sigma)),
        mu <- a + bn * neocortex),
  data = MilkCC,
  start = list(a = a.start, bn = 0, log.sigma = sigma.start)
)
m6.13 <- map(
  alist(kcal.per.g ~ dnorm(mu, exp(log.sigma)),
        mu <- a + bm * log(mass)),
  data = MilkCC,
  start = list(a = a.start, bm = 0, log.sigma = sigma.start)
)
m6.14 <- map(
  alist(kcal.per.g ~ dnorm(mu, exp(log.sigma)),
        mu <- a + bn * neocortex + bm * log(mass)),
  data = MilkCC,
  start = list(
    a = a.start,
    bn = 0,
    bm = 0,
    log.sigma = sigma.start
  )
)

WAIC

R code 6.23

WAIC(m6.14)
## Constructing posterior predictions
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
## [1] -15.14654
## attr(,"lppd")
## [1] 12.37018
## attr(,"pWAIC")
## [1] 4.79691
## attr(,"se")
## [1] 7.509313

R code 6.24

(milk.models <- compare(m6.11, m6.12, m6.13, m6.14))
##        WAIC pWAIC dWAIC weight   SE  dSE
## m6.14 -14.1   5.3   0.0   0.88 8.09   NA
## m6.13  -8.5   2.7   5.6   0.05 5.46 5.58
## m6.11  -8.4   1.8   5.7   0.05 4.47 7.65
## m6.12  -5.9   3.1   8.2   0.01 4.37 8.05

R code 6.25

plot(milk.models, SE = TRUE, dSE = TRUE)

R code 6.26

diff <- rnorm(1e5, 6.7, 7.26)
sum(diff < 0) / 1e5
## [1] 0.17553

R code 6.27

coeftab(m6.11, m6.12, m6.13, m6.14)
##           m6.11   m6.12   m6.13   m6.14  
## a            0.66    0.35    0.71   -1.09
## log.sigma   -1.79   -1.80   -1.85   -2.16
## bn             NA    0.45      NA    2.79
## bm             NA      NA   -0.03   -0.10
## nobs           17      17      17      17
coeftab(m6.11, m6.12, m6.13, m6.14, se = TRUE)
##              m6.11   m6.12   m6.13   m6.14  
## a               0.66    0.35    0.71   -1.09
## log.sigma      -1.79   -1.80   -1.85   -2.16
## bn                NA    0.45      NA    2.79
## bm                NA      NA   -0.03   -0.10
## a.se            0.04    0.47    0.05    0.47
## log.sigma.se    0.17    0.17    0.17    0.17
## bn.se             NA    0.69      NA    0.73
## bm.se             NA      NA    0.02    0.02
## nobs              17      17      17      17

R code 6.28

plot(coeftab(m6.11, m6.12, m6.13, m6.14))

With a little work, we can roll our own plot. The main advantage to this is that we can use a different scale for each parameter. This is bit clunky because the coeftab object is stored awkwardly. (Looks like another item for the rethinking pacakge to-do list.)

CT <- coeftab(m6.11, m6.12, m6.13, m6.14)
CTc <- CT@coefs
CTse <- CT@se
colnames(CTse) <- colnames(CTc)

CTd <-
  left_join(
    as.data.frame(CTc) %>%
      mutate(param = row.names(CTc)) %>%
      tidyr::gather(model, estimate, 1:ncol(CTc)),
    as.data.frame(CTse) %>%
      mutate(param = row.names(CTse)) %>%
      tidyr::gather(model, se, 1:ncol(CTse))
  )
## Joining, by = c("param", "model")
gf_hline(yintercept = 0, color = "red",  alpha = 0.4) %>%
  gf_pointrange( 
    estimate + (estimate - se) + (estimate + se) ~ model,
    data = CTd) %>%
  gf_facet_wrap(~ param, scales = "free") %>%
  gf_refine(coord_flip())
## Warning: Removed 4 rows containing missing values (geom_pointrange).

This plot provides a much better view of parameters on different scales.

R code 6.29

# compute counterfactual predictions
# neocortex from 0.5 to 0.8
Milk.predict <- data_frame(
  neocortex = seq(from = 0.5, to = 0.8, length.out = 30),
  # kcal.per.g = 0,    # this isn't needed.  not sure why author does it
  mass = 4.5  # average mass
)
pred.m6.14 <- link(m6.14, data = Milk.predict)
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
Milk.predict <-
  Milk.predict %>%
  mutate(
    mu =  apply(pred.m6.14, 2, mean),
    mu.PIlo = apply(pred.m6.14, 2, PI)[1,],
    mu.PIhi = apply(pred.m6.14, 2, PI)[2,]
    )

# plot it all
gf_point(kcal.per.g ~ neocortex, data = MilkCC, color = rangi2) %>%
  gf_line(mu ~ neocortex, data = Milk.predict, color = "red") %>%
  gf_ribbon(mu.PIlo + mu.PIhi ~ neocortex, data = Milk.predict)

R code 6.30

Milk.ensemble <-
  ensemble(m6.11, m6.12, m6.13, m6.14, data = Milk.predict)
## Constructing posterior predictions
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
## Constructing posterior predictions
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
## Constructing posterior predictions
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
## Constructing posterior predictions
## [ 100 / 1000 ]
[ 200 / 1000 ]
[ 300 / 1000 ]
[ 400 / 1000 ]
[ 500 / 1000 ]
[ 600 / 1000 ]
[ 700 / 1000 ]
[ 800 / 1000 ]
[ 900 / 1000 ]
[ 1000 / 1000 ]
Milk.predict <-
  Milk.predict %>%
  mutate(
    mu.ens =  apply(Milk.ensemble$link, 2, mean),
    mu.ens.lo =  apply(Milk.ensemble$link, 2, PI)[1,],
    mu.ens.hi =  apply(Milk.ensemble$link, 2, PI)[2,]
    )
gf_line(mu.ens ~ neocortex, data = Milk.predict) %>%
  gf_ribbon(mu.ens.lo + mu.ens.hi ~ neocortex, data = Milk.predict)

R code 6.31

library(rethinking)
data(Howell1)
Howell <- 
  Howell1 %>% 
  mutate(
    age.s = zscore(age)
  )
set.seed(1000)     # so we all get the same "random" data sets
train <- sample(1:nrow(Howell), size = nrow(Howell) / 2)  # half of the rows
Howell.train <- Howell[train, ]   # put half in training set
Howell.test <- Howell[-train, ]   # the other half in test set

R code 6.32

sum(dnorm(Howell.test$height, mu, sigma, log = TRUE))