Come up with something better than \(R^2\) for assessing how well our model is “working” (where working = prediction new responses).
Understand these new measures well enough to
This material is based on Rethinking 7
\(R^2\) has two basic problems.
It measures the wrong thing.
\(R^2\) compares model predictions to the data used to train the model. This is referred to as in-sample prediction or retrodiction. This has two downsides:
We really want to know how well we can expect our model to perform on new data, not on the current data.
It rewards complex models, an hence over-fitting.
As we add parameters to our model, the model will become more flexible and better able to fit the data, so complex models win the \(R^2\) competition, even when they make silly predictions due to overfitting.
It measures the wrong way.
\(R^2\) comes from a frequentist world of multiple regression. We would prefer a measure that it more in tune with Bayesian methods and more appropriate for a wider range of models.
Squares of residuals are not the best measure for all models – sometimes it isn’t even clear how one would calculate squares of residuals. Our new measures will be built on something more fundamental to the model.
You may have seen adjusted \(R^2\) in R output from other classes. It’s a crude way of dealing with the fact that \(R^2\) tends to get higher whenever we add more predictors. It’s based on the idea that if we add enough random predictors to the model, we can fit perfectly.
Recall that \[ 1 - R^2 = \frac{\operatorname{Var}(r)}{\operatorname{Var}(y)} = \frac{\operatorname{Var}(\mbox{residuals})}{\operatorname{Var}(\mbox{observed response})} \] If we have just an intercept, then \(\operatorname{Var}(r) \approx \operatorname{Var}(y)\) because \(\hat{y} \approx \overline{y}\), so \(1-R^2 \approx 1\). If we have \(n\) observations, then
The fraction of this expected gap that is still unexplained by our model is
\[ \mbox{adjusted } (1 - R^2) = \frac{1-R^2}{\frac{(n-k-1)}{(n-1)}} = (1-R^2) \frac{(n-1)}{(n-k-1)} \]
So
\[ \mbox{adjusted } R^2 = 1 – (1-R^2) \frac{(n-1)}{(n-k-1)} \] where \(n\) is the number of data observations and \(k-1\) is the number of predictors. (\(k\) is the number of parameters, including the intercept).
This adjustment essentially introduces a penalty for each parameter and helps avoid rewarding overfitting. But we can do better than \(R^2\), even the adjusted version.
Our goal is to come up with a better way to assess our models, something better than (adjusted) \(R^2\). It will be a bit of a journey. Here’s the road map:
\[ \mbox{Information} \to \mbox{Entropy} \to \mbox{Divergence} \to \mathrm{lppd} \to \mbox{PSIS & WAIC} \]
PSIS (Pareto Smoothed Importance Sampling Leave-one-out Cross Validation) and WAIC (Widely Applicable Information Criterion) are where we are heading. For now, you can think of them as improvements to (adjusted) \(R^2\) that will work better and for a wider range of Bayesian models. Details to come.
Intuition:
Let’s begin by considering the amount of information we gain when we observe some random process.
Suppose that the outcome we observed has probability \(p\). Let \(I(p)\) be the amount of information we gain from observing this outcome. \(I(p)\) depends on \(p\) but not on the outcome itself, and should satisfy the following properties.
1. \(I(1) =\)
2. \(I(0) =\)
3. Between \(p = 0\) and \(p = 1\), \(I()\) is
4. \(I(p_1 p_2) =\)
1. \(I(1) = 0\)
2. \(I(0) = \mbox{undefined } (\infty?)\)
3. Between \(p = 0\) and \(p = 1\), \(I()\) is decreasing and continuous.
4. \(I(p_1 p_2) = I(p_1) + I(p_2)\)
These properties should remind you of a function you have seen before.
We get a function with the desired properties if we define
\[ I(p) = - \log_{b}(p) \]
Now consider a random process \(X\) with \(n\) outcomes having probabilities \(\mathbf{p} = p_1, p_2, \dots, p_n\). That is, \[ P(X = x_i) = p_i, \]
The Shannon entropy (denoted \(H(X)\) or \(H(\mathbf{p}\)) is the average amount of information gained from each observation of the random process:
\[ H(X) = H(\mathbf{p}) = \sum p_i \cdot I(p_i) = - \sum p_i \log(p_i) \]
Compute the entropy of each of the following random processes
A “random” process \(X\) that always yields the same result
p <- 1
- sum(p * log2(p))
## [1] 0
A single toss of a fair coin
p <- c(1/2, 1/2)
- sum(p * log2(p))
## [1] 1
Two tosses of a fair coin
p <- c(1/4, 1/4, 1/4, 1/4)
- sum(p * log2(p))
## [1] 2
A single toss of a 30-70 coin
# less than a bit of entropy
p <- c(.3, .7)
- sum(p * log2(p))
## [1] 0.8812909
The roll of a die
p <- rep(1/6, 6)
- sum(p * log2(p))
## [1] 2.584963
In the text, the default is natural logarithms. In that case, the unit for entropy is called a nat. If we use base 2 logarithms, then the unit is called a bit or a Shannon.
\(H(X) \ge 0\)
Outcomes with probability 0 can be ignored.
\(H(X)\), like \(I(p_i)\) depends only on the probabilities, not on the outcomes themselves.
\(H\) is a continuous function.
Among all distributions with a fixed number of outcomes, \(H\) is maximized when the outcomes are equally likely.
Among equiprobable distributions, \(H\) increases as the number of outcomes increases.
\(H\) is additive in the following sense:
\(H\) can be thought of as a measure of uncertainty. Uncertainty decreases as we make observations.
It’s pretty easy to write a function to compute entropy from a vector of probabilities.
H <- function(p, base = exp(1)) {
p <- p / sum(p) # rescale so probabilities add to 1
# if (sum(p) > 1) stop( "too much probability")
# if (sum(p) < 1) p <- c(p, 1 - sum(p))
p <- p[p!= 0] # remove probabilities of 0
- sum(p * log(p, base = base))
}
# in nats
H(c(0.5, 0.5))
## [1] 0.6931472
H(c(0.3, 0.7))
## [1] 0.6108643
# in shannons
H(c(0.5, 0.5), base = 2)
## [1] 1
H(c(0.3, 0.7), base = 2)
## [1] 0.8812909
H_example <-
tibble(
p = seq(0, 1, by = 0.01),
H = purrr::map_dbl(p, ~ H(c(.x, 1 - .x)))
)
gf_line(H ~ p, data = H_example)
Kullback-Leibler (KL) divergence compares two distributions and asks “if we are anticipating \(\mathbf{q}\), but get \(\mathbf{p}\), how much more surprised will we be than if we had been expecting \(\mathbf{p}\) in the first place?”
Here’s the definition
\[\begin{align*} D_{KL}(\mathbf{p}, \mathbf{q}) &= \mathrm{expected\ difference\ in\ ``surprise"} \\ &= \sum p_i \left( I(q_i) - I(p_i) \right) \\ &= \sum p_i I(q_i) - \sum p_i I(p_i) \\ &= \sum p_i \log p_i - \sum p_i \log q_i \\ \end{align*}\]
This looks like the difference between two entropies. It almost is. The first one is actually a cross entropy where we use probabilities from one distribution and information from the other. We denote this \[ H(\mathbf{p}, \mathbf{q}) = \sum p_i I(q_i) = - \sum p_i \log(q_i) \] Note that \(H(\mathbf{p}) = H(\mathbf{p}, \mathbf{p})\), so \[\begin{align*} D_{KL} &= H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}, \mathbf{p}) \\ &= H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}) \end{align*}\]
Note: \(H(p,q)\) is not symmetric.
H(0.7, 0.01)
## [1] 2.61577
H(0.01, 0.7)
## [1] 1.139498
The image below shows \(H(p, q)\) comparing tosses of “biased coins”. The points are at (0.05, 0.8) and (0.8, 0.05).
But if we want to know how well \(\mathbf{q}\) approximates an unknown \(\mathbf{p}\), we still have a problem – we don’t know \(\mathbf{p}\)! Time for the next leg of our journey.
If we want to know how well \(\mathbf{q}\) approximates an unknown \(\mathbf{p}\), we still have a problem – we don’t know \(\mathbf{p}\)!
Fortunately, it usually suffices to know which of two distributions approximates our target \(\mathbf{p}\) better (and by how much). And in the context of comparing models, there is a way to approximate the difference in fit without needing direct knowledge of \(\mathbf{p}\). So we can know (approximately) which attempt is closer to the target, and by how much, without knowing where (exactly) the target is. Of course, the target will matter some, so there must be a bit of a trick here. (There is: stay tuned.)
Suppose we want to know which of \(\mathbf{q}\) or \(\mathbf{r}\) is a better approximation to \(\mathbf{p}\). We might measure this with
\[\begin{align*} D_{KL}(\mathbf{p}, \mathbf{q}) - D_{KL}(\mathbf{p}, \mathbf{r}) & = H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}, \mathbf{p}) - \left[ H(\mathbf{p}, \mathbf{r}) - H(\mathbf{p}, \mathbf{p}) \right] \\ & = H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}, \mathbf{r}) \\ & = - \sum p_i \left( \log q_i - \log r_i\right) \end{align*}\]
If \(D_{KL}(\mathbf{p}, \mathbf{q})\) is small, then our model (\(\mathbf{q}\)) fits the data (\(\mathbf{p}\)) well.
First we need to define \(\mathbf{p}\) and \(\mathbf{q}\).
\(p_i = 1/n\) for each observation \(i\) in the data
This an approximation based on our data, assuming the data were sampled according to the model specification and eliminates the final occurrence of \(\mathbf{p}\).
\(q_i = \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s) =\) posterior probability (according to the model) of observing row \(i\) of the data.
Using this approximation we get our definition of deviance and log-pointwise-predictive density (lppd).
\[\begin{align*} \sum_{i = 1}^n p_i \log q_i &= \sum_{i = 1}^n \underbrace{\frac{1}{n}}_{\Large p_i} \log \underbrace{\frac1S \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s)}_{\Large q_i} \\ &= \frac 1n \underbrace{ \sum_{i = 1}^n \log \frac1S \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s)}_{\Large \mathrm{lppd}} \end{align*}\]
\[\begin{align*} \mathrm{lppd} &= \sum_{i = 1}^{n} \underbrace{ \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s) }_{\Large q_i} \\ \end{align*}\]
\[ -2 \cdot \mathrm{lppd}\approx -2 \cdot \mathrm{loglik}(\mathrm{model}) = \mathrm{Deviance} \]
Notes:
The difference between deviance and lppd is where the averaging happens.
the scaling (-2) here is different for historical reasons, but largely irrelevant to us. This is sometimes referred to as “the deviance scale”.
You will sometimes see \(-2 \cdot \mathrm{lppd}\) as well.
In either case closer to 0 is “better”.
How can you calculate deviance for the following model?
Brains <-
tibble(
species = c("afarensis", "africanus", "habilis", "boisei",
"rudolfensis", "ergaster", "sapiens"),
brain_size = c(438, 452, 612, 521, 752, 871, 1350),
body_mass = c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5),
body_mass.s = zscore(body_mass)
)
m6.8 <- quap(
alist(brain_size ~ dnorm(mu, sigma),
mu <- a + b * body_mass.s),
data = Brains,
start = list(
a = mean(Brains$brain_size),
b = 0,
sigma = sd(Brains$brain_size)
),
method = "Nelder-Mead"
)
# extract QUAP estimates
theta <- coef(m6.8); theta
## a b sigma
## 713.6931 225.5774 212.9408
# compute deviance
dev <- (-2) * sum(dnorm(
Brains$brain_size,
mean = theta[1] + theta[2] * Brains$body_mass.s,
sd = theta[3],
log = TRUE
))
dev %>% setNames("dev") # setNames just labels the out put
## dev
## 94.92499
-2 * logLik(m6.8) # for comparison
## 'log Lik.' 94.92499 (df=3)
# fit model with lm
m6.1 <- lm(brain_size ~ body_mass, data = Brains)
# compute deviance by cheating
(-2) * logLik(m6.1)
## 'log Lik.' 94.92499 (df=3)
The deviance computed using our data underestimates the deviance of the model using new data. Here’s a relatively simple simulation to demonstrate.
(The book uses a fancier version, but this will give you a sense for how the fancier version works.)
Dev.in.out <-
function(n = 20, b = c(0.15, -0.4), sigma = 1) {
p <- length(b)
k <- p + 1
# Create two data sets
X <- runif(n * p, 0, 1) %>% matrix(nrow = n)
y <- X %*% b + rnorm(n, 0, sigma)
D.in <- cbind(y, X) %>% data.frame() %>% setNames(c("y", paste0("x", 1:p)))
X <- runif(n * p, 0, 1) %>% matrix(nrow = n)
y <- X %*% b + rnorm(n, 0, sigma)
D.out <- cbind(y, X) %>% data.frame() %>% setNames(c("y", paste0("x", 1:p)))
# Fit using the first one
m <-
quap(
alist(
y ~ dnorm(mu, sigma),
mu <- b1 * x1 + b2 * x2,
b1 ~ dnorm(0, 5),
b2 ~ dnorm(0, 5),
sigma ~ dunif(0,5)
),
data = D.in
)
# compute deviance twice: in-sample and out-of-sample
c(
dev.in = -2 * logLik(m),
dev.out = -2 *
sum(dnorm(D.out$y, as.matrix(D.out[,-1]) %*% coef(m)[-k], coef(m)[k], log = TRUE))
)
}
# Simulate data with y ~ Norm(0.15 * x1 - 0.4 * x2, 1); x1 and x2 random
Dev.in.out(n = 20, b = c(0.15, -0.4), sigma = 1)
## dev.in dev.out
## 65.18222 61.76786
Sim.in.out <- do(500) * Dev.in.out(n = 20, b = c(0.15, -0.4), sigma = 1)
## Using parallel package.
## * Set seed with set.rseed().
## * Disable this message with options(`mosaic:parallelMessage` = FALSE)
Sim.in.out %>% gather(type, dev) %>%
group_by(type) %>%
summarise(mean = mean(dev), lo = quantile(dev, 0.25), hi = quantile(dev, 0.75)) %>%
gf_pointrange( mean + lo + hi ~ type)
gf_point(dev.out ~ dev.in, data = Sim.in.out, alpha = 0.4) %>%
gf_abline(slope = 1, intercept = 0, color = "red", alpha = 0.6)
## Warning: geom_abline(): Ignoring `mapping` because `slope` and/or `intercept`
## were provided.
mean(~(dev.out - dev.in), data = Sim.in.out)
## [1] 7.671298
This is the same problem that (unadjusted) \(R^2\) has. So we need an adjustment.
require(tidyr)
Sims0 <- expand.grid(n = 20, k = 1:5, rep = 1:1e3)
Sims <-
bind_cols(
Sims0,
apply(Sims0, 1, function(x) sim.train.test(N = x["n"], k = x["k"])) %>%
t() %>% # flip rows/cols
data.frame() %>% # convert to data.frame
setNames(c("dev.in", "dev.out")) # give better names to vars
)
# reshape, then compute mean and PI
Sims2 <-
Sims %>%
gather(dev_type, dev, dev.in : dev.out) %>%
group_by(n, k, dev_type) %>%
summarise(
mean = mean(dev),
lo = PI(dev, .50)[1],
hi = PI(dev, .50)[2]
)
## `summarise()` has grouped output by 'n', 'k'. You can override using the `.groups` argument.
r <- mcreplicate(1e4, sim.train.test(N = N, k = k), mc.cores = 4)
gf_pointrange(
mean + lo + hi ~ k | ~ paste("N =", n), data = Sims2,
color = ~ dev_type,
position = position_dodge(width = 0.5)) %>%
gf_labs(x = "number of parameters", y = "deviance")
Alternatively, we could use boxplots. This avoids the need to group and summarize of simulated data, but the tails of the distributions make the main story harder to see.
Sims %>%
gather(dev_type, dev, dev.in : dev.out) %>%
gf_boxplot( dev ~ factor(k), color = ~ dev_type) %>%
gf_facet_wrap(~ paste("N =", n)) %>%
gf_labs(x = "number of parameters", y = "deviance")
The simulations above show that our in-sample deviance/lppd overestimates how well the model will do at predicting new data. As with \(R^2\), we’d like to make an adjustment to correct for this.
PSIS (Pareto-smoothed importance sampling) is based on leave-one-out cross-validation.
\[\begin{align*} \mathrm{lppd} &= \sum_{i = 1}^{n} \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s) \\ \mathrm{lppd}_{CV} &= \sum_{i = 1}^{n} \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_{-i,s}) \end{align*}\]
\(\theta_{-i, s}\) is row \(s\) of the posterior if we fit our model with all of the data except for row \(i\).
This would be very expensive to compute (because we would need to fit lots of models).
PSIS provides a clever way to approximate this without fitting so many models. Big ideas:
Big idea: Removing one row from the data doesn’t change the posterior very much unless that row is “unexpected” by the model.
We can approximate the LOO posteriors by weighted sampling from the full posterior.
Weights are determined by how unexpected a row of the data is
\[ w_{i,s} = \frac{1}{{\sf Pr}(\mathrm{data}_i \mid \theta_s)} \]
This way of weighting is called importance sampling.
An additional step is taken to adjust the weights based on the fact that they should follow approximately a Pareto diagram. This Pareto smoothing makes things less sensitive to poor estimates of the weights, and also provides a diagnostic measure for when the approximation involved might not be very good.
Using the smoothed version of the weights we get
\[\begin{align*} \mathrm{elpd}_{\mathrm{loo}} = \mathrm{lppd}_{\mathrm{IS}} &= \sum_{i=1}^n \log \left( \frac{\sum_{s = 1}^S w_{i,s} {\sf Pr}(\mathrm{data}_i \mid \theta_s)} {\sum_{s = 1}^S w_{i,s}}\right) \\ &\approx \sum_{i = 1}^{n} \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_{-i,s}) \\ & = \mathrm{lppd}_{CV} \end{align*}\]
Remember: \({\sf Pr}(\mathrm{data}_i \mid \theta_s)\) is just the likelihood of row \(i\) of the the data using parameters from row \(s\) of the posterior. Remember: \({\sf Pr}(\mathrm{data}_i \mid \theta_{-i,s})\) is the same but uses the posterior distribution of a model that omits row \(i\) from the data.
Below we present two information criteria that attempt to adjust deviance (\(D\)) so that it provides a better estimate for out-of-sample performance.
\(AIC = D_{\mathrm{train}} + 2p\)
This penalizes the model for each parameter, but it is only an accurate estimate of out-of-sample performance if
Since flat priors are rarely best, we won’t do much with AIC.
\(\mathrm{WAIC} = -2 \mathrm{lppd}+ 2 p_{\mbox{WAIC}}\).
\(p_{\mbox{WAIC}} = \sum_{i=1}^{n} \operatorname{Var}_{\theta} \log {\sf Pr}(\mathrm{data}_i \mid \theta)\)
idea: The more changing a parameter (among its posterior plausible values) changes the likelihood, the more that parameter allows the model to flexibly (over)fit the data.
So instead of counting all parameters the same (as in AIC), we penalize the model according to the overall amount of flexibility it has.
This will be very important in some Bayesian models that have many parameters that lend relatively little flexibility to the model.
\(p_{\mbox{WAIC}}\) is sometimes called the effective number of parameters
We don’t ever want to compute these by hand. They are involved computations and involved some careful programming to avoid accumulated rounding errors.
Fortunately, R will do all the work for us. Our job will be learning how to make use of these computations.
For now, one quick example. Note that we need to tell stan to keep track of the log-likelihood information it needs to compute these.
Pallets2 <-
fastR2::Pallets %>%
mutate(
emp_idx = as.numeric(factor(employee)),
day_idx = as.numeric(factor(day))
)
set.seed(666)
u6 <- ulam(
data = Pallets2,
alist(
pallets ~ dnorm(mu, sigma),
mu <- b[emp_idx] + d[day_idx],
b[emp_idx] ~ dnorm(125, 20),
d[day_idx] ~ dnorm(0, 10),
sigma ~ dexp(1)
),
chains = 4, iter = 4000, warmup = 1000, cores = 4, refresh = 0,
log_lik = TRUE, # NB: need to tell stand to keep track of log-likelihood
file = "fits/test2-u6")
u6 %>% stanfit() %>% loo::loo()
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
##
## Computed from 12000 by 20 log-likelihood matrix
##
## Estimate SE
## elpd_loo -46.9 4.2
## p_loo 9.1 2.7
## looic 93.9 8.4
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 14 70.0% 1940
## (0.5, 0.7] (ok) 3 15.0% 1282
## (0.7, 1] (bad) 3 15.0% 23
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
u6 %>% stanfit() %>% loo::extract_log_lik() %>% waic()
## Warning:
## 5 (25.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
##
## Computed from 12000 by 20 log-likelihood matrix
##
## Estimate SE
## elpd_waic -45.3 3.5
## p_waic 7.5 2.0
## waic 90.7 7.0
##
## 5 (25.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.
u6a <- u6
rethinking::compare(u6, u6a)
## WAIC SE dWAIC dSE.u6 pWAIC weight
## u6 90.69338 6.831297 0 NA 7.493209 0.5
## u6a 90.69338 6.831297 0 0 7.493209 0.5