Something(s) better than R^2

Goals for today (and the future)

Come up with something better than \(R^2\) for assessing how well our model is “working” (where working = prediction new responses).
Understand these new measures well enough to
- Use them in practice.
- Know when they might not work well.

This material is based on Rethinking 7

The Trouble with \(R^2\)

\(R^2\) has two basic problems.

It measures the wrong thing.

\(R^2\) compares model predictions to the data used to train the model. This is referred to as in-sample prediction or retrodiction. This has two downsides:
1. We really want to know how well we can expect our model to perform on new data, not on the current data.
2. It rewards complex models, an hence over-fitting.
  
  As we add parameters to our model, the model will become more flexible and better able to fit the data, so complex models win the \(R^2\) competition, even when they make silly predictions due to overfitting.
It measures the wrong way.

\(R^2\) comes from a frequentist world of multiple regression. We would prefer a measure that it more in tune with Bayesian methods and more appropriate for a wider range of models.

Squares of residuals are not the best measure for all models – sometimes it isn’t even clear how one would calculate squares of residuals. Our new measures will be built on something more fundamental to the model.

Adjusted R^2

You may have seen adjusted \(R^2\) in R output from other classes. It’s a crude way of dealing with the fact that \(R^2\) tends to get higher whenever we add more predictors. It’s based on the idea that if we add enough random predictors to the model, we can fit perfectly.

Recall that \[ 1 - R^2 = \frac{\operatorname{Var}(r)}{\operatorname{Var}(y)} = \frac{\operatorname{Var}(\mbox{residuals})}{\operatorname{Var}(\mbox{observed response})} \] If we have just an intercept, then \(\operatorname{Var}(r) \approx \operatorname{Var}(y)\) because \(\hat{y} \approx \overline{y}\), so \(1-R^2 \approx 1\). If we have \(n\) observations, then

\(n-1\) random predictors in a multiple regression model will produce a perfect fit.
On average, each new parameter will close the gap from 0 to 1 by \(\frac{1}{n-1}\).
If we have \(k - 1\) predictors (plus an intercept for a total of \(k\) parameters), then the expected remaining gap is not 1 but \(\frac{n-k-1}{n-1}\).

The fraction of this expected gap that is still unexplained by our model is

\[ \mbox{adjusted } (1 - R^2) = \frac{1-R^2}{\frac{(n-k-1)}{(n-1)}} = (1-R^2) \frac{(n-1)}{(n-k-1)} \]

\[ \mbox{adjusted } R^2 = 1 – (1-R^2) \frac{(n-1)}{(n-k-1)} \] where \(n\) is the number of data observations and \(k-1\) is the number of predictors. (\(k\) is the number of parameters, including the intercept).

This adjustment essentially introduces a penalty for each parameter and helps avoid rewarding overfitting. But we can do better than \(R^2\), even the adjusted version.

Introduction: The Road Map to Something Better

Our goal is to come up with a better way to assess our models, something better than (adjusted) \(R^2\). It will be a bit of a journey. Here’s the road map:

\[ \mbox{Information} \to \mbox{Entropy} \to \mbox{Divergence} \to \mathrm{lppd} \to \mbox{PSIS & WAIC} \]

PSIS (Pareto Smoothed Importance Sampling Leave-one-out Cross Validation) and WAIC (Widely Applicable Information Criterion) are where we are heading. For now, you can think of them as improvements to (adjusted) \(R^2\) that will work better and for a wider range of Bayesian models. Details to come.

Information = Surprise

Intuition:

If we don’t get any new information, we can’t be surprised.
If we get what we expect, that isn’t really much information (not surprised).
If we get something unexpected, the fact that it is surprising means that we have been provided with some genuinely new information.

Let’s begin by considering the amount of information we gain when we observe some random process.
Suppose that the outcome we observed has probability \(p\). Let \(I(p)\) be the amount of information we gain from observing this outcome. \(I(p)\) depends on \(p\) but not on the outcome itself, and should satisfy the following properties.

1. \(I(1) =\)

2. \(I(0) =\)

3. Between \(p = 0\) and \(p = 1\), \(I()\) is

4. \(I(p_1 p_2) =\)

1. \(I(1) = 0\)

No information is gained; we already knew the outcome.

2. \(I(0) = \mbox{undefined } (\infty?)\)

Undefined – we can’t observe something that has probability 0.
Or perhaps \(\infty\) since we would be infinitely surprised to observe something impossible.

3. Between \(p = 0\) and \(p = 1\), \(I()\) is decreasing and continuous.

4. \(I(p_1 p_2) = I(p_1) + I(p_2)\)

Reason: This is the same as observing two independent outcomes, the first with probability \(p_1\) and the second with probability \(p_2\) and we gain independent information each time.

These properties should remind you of a function you have seen before.

We get a function with the desired properties if we define

\[ I(p) = - \log_{b}(p) \]

Any base will do. \(e\), 2, and 10 are common.
Need the negative sign to make things positive and decreasing since \(\log(p) < 0\) when \(p < 1\).

Entropy = average information

Definition of Entropy

Now consider a random process \(X\) with \(n\) outcomes having probabilities \(\mathbf{p} = p_1, p_2, \dots, p_n\). That is, \[ P(X = x_i) = p_i, \]

The Shannon entropy (denoted \(H(X)\) or \(H(\mathbf{p}\)) is the average amount of information gained from each observation of the random process:

\[ H(X) = H(\mathbf{p}) = \sum p_i \cdot I(p_i) = - \sum p_i \log(p_i) \]

Examples

Compute the entropy of each of the following random processes

A “random” process \(X\) that always yields the same result
```
p <- 1
- sum(p * log2(p))
```
```
## [1] 0
```

A single toss of a fair coin

p <- c(1/2, 1/2)
- sum(p * log2(p))

## [1] 1

Two tosses of a fair coin

p <- c(1/4, 1/4, 1/4, 1/4)
- sum(p * log2(p))

## [1] 2

A single toss of a 30-70 coin

# less than a bit of entropy
p <- c(.3, .7)
- sum(p * log2(p))

## [1] 0.8812909

The roll of a die

p <- rep(1/6, 6)
- sum(p * log2(p))

## [1] 2.584963

In the text, the default is natural logarithms. In that case, the unit for entropy is called a nat. If we use base 2 logarithms, then the unit is called a bit or a Shannon.

Properties of Entropy

\(H(X) \ge 0\)
Outcomes with probability 0 can be ignored.
- \(\log(0)\) is a problem.
- But \(\lim_{p \to 0} p \log(p) = 0\).
- So for the purposes of entropy, we will consider consider \(0\log(0) = 0\).
\(H(X)\), like \(I(p_i)\) depends only on the probabilities, not on the outcomes themselves.
\(H\) is a continuous function.
Among all distributions with a fixed number of outcomes, \(H\) is maximized when the outcomes are equally likely.
Among equiprobable distributions, \(H\) increases as the number of outcomes increases.
\(H\) is additive in the following sense:
- If \(X\) and \(Y\) are independent, then \(H(\langle X, Y\rangle) = H(X) + H(Y)\).

\(H\) can be thought of as a measure of uncertainty. Uncertainty decreases as we make observations.

Entropy in R

It’s pretty easy to write a function to compute entropy from a vector of probabilities.

H <- function(p, base = exp(1)) {
  p <- p / sum(p)  # rescale so probabilities add to 1
  # if (sum(p) > 1) stop( "too much probability")
  # if (sum(p) < 1) p <- c(p, 1 - sum(p))
  p <- p[p!= 0]    # remove probabilities of 0
  - sum(p * log(p, base = base))
}

# in nats
H(c(0.5, 0.5))

## [1] 0.6931472

H(c(0.3, 0.7))

## [1] 0.6108643

# in shannons
H(c(0.5, 0.5), base = 2)

## [1] 1

H(c(0.3, 0.7), base = 2)

## [1] 0.8812909

H_example <- 
  tibble( 
    p = seq(0, 1, by = 0.01),
    H = purrr::map_dbl(p, ~ H(c(.x, 1 - .x)))
  )
gf_line(H ~ p, data = H_example)

Divergence: How well does q approximate p?

Kullback-Leibler (KL) divergence compares two distributions and asks “if we are anticipating \(\mathbf{q}\), but get \(\mathbf{p}\), how much more surprised will we be than if we had been expecting \(\mathbf{p}\) in the first place?”

Here’s the definition

\[\begin{align*} D_{KL}(\mathbf{p}, \mathbf{q}) &= \mathrm{expected\ difference\ in\ ``surprise"} \\ &= \sum p_i \left( I(q_i) - I(p_i) \right) \\ &= \sum p_i I(q_i) - \sum p_i I(p_i) \\ &= \sum p_i \log p_i - \sum p_i \log q_i \\ \end{align*}\]

This looks like the difference between two entropies. It almost is. The first one is actually a cross entropy where we use probabilities from one distribution and information from the other. We denote this \[ H(\mathbf{p}, \mathbf{q}) = \sum p_i I(q_i) = - \sum p_i \log(q_i) \] Note that \(H(\mathbf{p}) = H(\mathbf{p}, \mathbf{p})\), so \[\begin{align*} D_{KL} &= H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}, \mathbf{p}) \\ &= H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}) \end{align*}\]

Note: \(H(p,q)\) is not symmetric.

H(0.7, 0.01)

## [1] 2.61577

H(0.01, 0.7)

## [1] 1.139498

The image below shows \(H(p, q)\) comparing tosses of “biased coins”. The points are at (0.05, 0.8) and (0.8, 0.05).

But if we want to know how well \(\mathbf{q}\) approximates an unknown \(\mathbf{p}\), we still have a problem – we don’t know \(\mathbf{p}\)! Time for the next leg of our journey.

lppd and Deviance: Which of q or r approximates p better?

If we want to know how well \(\mathbf{q}\) approximates an unknown \(\mathbf{p}\), we still have a problem – we don’t know \(\mathbf{p}\)!

Fortunately, it usually suffices to know which of two distributions approximates our target \(\mathbf{p}\) better (and by how much). And in the context of comparing models, there is a way to approximate the difference in fit without needing direct knowledge of \(\mathbf{p}\). So we can know (approximately) which attempt is closer to the target, and by how much, without knowing where (exactly) the target is. Of course, the target will matter some, so there must be a bit of a trick here. (There is: stay tuned.)

Suppose we want to know which of \(\mathbf{q}\) or \(\mathbf{r}\) is a better approximation to \(\mathbf{p}\). We might measure this with

\[\begin{align*} D_{KL}(\mathbf{p}, \mathbf{q}) - D_{KL}(\mathbf{p}, \mathbf{r}) & = H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}, \mathbf{p}) - \left[ H(\mathbf{p}, \mathbf{r}) - H(\mathbf{p}, \mathbf{p}) \right] \\ & = H(\mathbf{p}, \mathbf{q}) - H(\mathbf{p}, \mathbf{r}) \\ & = - \sum p_i \left( \log q_i - \log r_i\right) \end{align*}\]

If \(D_{KL}(\mathbf{p}, \mathbf{q})\) is small, then our model (\(\mathbf{q}\)) fits the data (\(\mathbf{p}\)) well.

Minding our p’s and q’s – How does this apply to Bayesian models?

First we need to define \(\mathbf{p}\) and \(\mathbf{q}\).

\(p_i = 1/n\) for each observation \(i\) in the data

This an approximation based on our data, assuming the data were sampled according to the model specification and eliminates the final occurrence of \(\mathbf{p}\).
\(q_i = \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s) =\) posterior probability (according to the model) of observing row \(i\) of the data.

Using this approximation we get our definition of deviance and log-pointwise-predictive density (lppd).

\[\begin{align*} \sum_{i = 1}^n p_i \log q_i &= \sum_{i = 1}^n \underbrace{\frac{1}{n}}_{\Large p_i} \log \underbrace{\frac1S \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s)}_{\Large q_i} \\ &= \frac 1n \underbrace{ \sum_{i = 1}^n \log \frac1S \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s)}_{\Large \mathrm{lppd}} \end{align*}\]

\[\begin{align*} \mathrm{lppd} &= \sum_{i = 1}^{n} \underbrace{ \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s) }_{\Large q_i} \\ \end{align*}\]

\[ -2 \cdot \mathrm{lppd}\approx -2 \cdot \mathrm{loglik}(\mathrm{model}) = \mathrm{Deviance} \]

Notes:

The difference between deviance and lppd is where the averaging happens.
- Deviance average early – it uses the posterior mean parameter values;
  - Deviance can be calculated for any model (Bayesian or not) that produces estimates for the parameters.
- lppd averages late – lppd computes for each posterior row and then averages.
  - that makes lppd harder to calculate (you need the full posterior, not just the marginal posterior means) but more “Bayesian”.
the scaling (-2) here is different for historical reasons, but largely irrelevant to us. This is sometimes referred to as “the deviance scale”.
- You will sometimes see \(-2 \cdot \mathrm{lppd}\) as well.
- In either case closer to 0 is “better”.

Calculating Deviance: More Brains

How can you calculate deviance for the following model?

Brains <- 
  tibble(
    species =  c("afarensis", "africanus", "habilis", "boisei",
                 "rudolfensis", "ergaster", "sapiens"),
    brain_size = c(438, 452, 612, 521, 752, 871, 1350),
    body_mass =  c(37.0, 35.5, 34.5, 41.5, 55.5, 61.0, 53.5),
    body_mass.s = zscore(body_mass)
  )

m6.8 <- quap(
  alist(brain_size ~ dnorm(mu, sigma),
        mu <- a + b * body_mass.s),
  data = Brains,
  start = list(
    a = mean(Brains$brain_size),
    b = 0,
    sigma = sd(Brains$brain_size)
  ),
  method = "Nelder-Mead"
)

# extract QUAP estimates
theta <- coef(m6.8); theta

##        a        b    sigma 
## 713.6931 225.5774 212.9408

Solution 1: Rolling our own

# compute deviance
dev <- (-2) * sum(dnorm(
  Brains$brain_size,
  mean = theta[1] + theta[2] * Brains$body_mass.s,
  sd = theta[3],
  log = TRUE
))
dev %>% setNames("dev")  # setNames just labels the out put

##      dev 
## 94.92499

-2 * logLik(m6.8)        # for comparison

## 'log Lik.' 94.92499 (df=3)

Solution 2: Out of the box

# fit model with lm
m6.1 <- lm(brain_size ~ body_mass, data = Brains)

# compute deviance by cheating
(-2) * logLik(m6.1)

## 'log Lik.' 94.92499 (df=3)

Deviance: In-sample vs out-of-sample

The deviance computed using our data underestimates the deviance of the model using new data. Here’s a relatively simple simulation to demonstrate.
(The book uses a fancier version, but this will give you a sense for how the fancier version works.)

Dev.in.out <- 
  function(n = 20, b = c(0.15, -0.4), sigma = 1) {
  p <- length(b)
  k <- p + 1
 
  # Create two data sets 
  X <- runif(n * p, 0, 1) %>% matrix(nrow = n)
  y <- X %*% b + rnorm(n, 0, sigma)
  D.in <- cbind(y, X) %>% data.frame() %>% setNames(c("y", paste0("x", 1:p)))
  
  X <- runif(n * p, 0, 1) %>% matrix(nrow = n)
  y <- X %*% b + rnorm(n, 0, sigma)
  D.out <- cbind(y, X) %>% data.frame() %>% setNames(c("y", paste0("x", 1:p)))

  # Fit using the first one 
  m <-
    quap(
      alist(
        y ~ dnorm(mu, sigma),
        mu <- b1 * x1 + b2 * x2,
        b1 ~ dnorm(0, 5),
        b2 ~ dnorm(0, 5),
        sigma ~ dunif(0,5)
      ),
      data = D.in
    ) 
  # compute deviance twice: in-sample and out-of-sample
  c(
    dev.in = -2 * logLik(m),
    dev.out = -2 * 
      sum(dnorm(D.out$y, as.matrix(D.out[,-1]) %*% coef(m)[-k], coef(m)[k], log = TRUE))
  )
  }

# Simulate data with y ~ Norm(0.15 * x1 - 0.4 * x2, 1); x1 and x2 random

Dev.in.out(n = 20, b = c(0.15, -0.4), sigma = 1)

##   dev.in  dev.out 
## 65.18222 61.76786

Sim.in.out <- do(500) * Dev.in.out(n = 20, b = c(0.15, -0.4), sigma = 1)

## Using parallel package.
##   * Set seed with set.rseed().
##   * Disable this message with options(`mosaic:parallelMessage` = FALSE)

Sim.in.out %>% gather(type, dev) %>%
  group_by(type) %>% 
  summarise(mean = mean(dev), lo = quantile(dev, 0.25), hi = quantile(dev, 0.75)) %>%
  gf_pointrange( mean + lo + hi ~ type)

gf_point(dev.out ~ dev.in, data = Sim.in.out, alpha = 0.4) %>%
  gf_abline(slope = 1, intercept = 0, color = "red", alpha = 0.6)

## Warning: geom_abline(): Ignoring `mapping` because `slope` and/or `intercept`
## were provided.

mean(~(dev.out - dev.in), data = Sim.in.out)

## [1] 7.671298

This is the same problem that (unadjusted) \(R^2\) has. So we need an adjustment.

Fancier simulation (book version)

R code 6.12

require(tidyr)
Sims0 <- expand.grid(n = 20, k = 1:5, rep = 1:1e3) 
Sims <-
  bind_cols(
    Sims0,
    apply(Sims0, 1, function(x) sim.train.test(N = x["n"], k = x["k"])) %>%
      t() %>%         # flip rows/cols
      data.frame() %>% # convert to data.frame
      setNames(c("dev.in", "dev.out"))  # give better names to vars
  )

# reshape, then compute mean and PI
Sims2 <-
  Sims %>%
  gather(dev_type, dev, dev.in : dev.out) %>%
  group_by(n, k, dev_type) %>%
  summarise(
    mean = mean(dev),
    lo = PI(dev, .50)[1],
    hi = PI(dev, .50)[2]
  )

## `summarise()` has grouped output by 'n', 'k'. You can override using the `.groups` argument.

R code 6.13

r <- mcreplicate(1e4, sim.train.test(N = N, k = k), mc.cores = 4)

R code 6.14

This plot uses 50% PIs (rather than \(\pm\) 1 sd, as in the book) for the bars.

gf_pointrange(
  mean + lo + hi ~ k | ~ paste("N =", n), data = Sims2,
  color = ~ dev_type,
  position = position_dodge(width = 0.5)) %>%
  gf_labs(x = "number of parameters", y =  "deviance")

Alternatively, we could use boxplots. This avoids the need to group and summarize of simulated data, but the tails of the distributions make the main story harder to see.

Sims %>%
  gather(dev_type, dev, dev.in : dev.out) %>%
gf_boxplot( dev ~ factor(k), color = ~ dev_type) %>%
  gf_facet_wrap(~ paste("N =", n)) %>%
  gf_labs(x = "number of parameters", y =  "deviance")

Making Adjustments

The simulations above show that our in-sample deviance/lppd overestimates how well the model will do at predicting new data. As with \(R^2\), we’d like to make an adjustment to correct for this.

PSIS-LOO-CV (PSIS for short)

PSIS (Pareto-smoothed importance sampling) is based on leave-one-out cross-validation.

\[\begin{align*} \mathrm{lppd} &= \sum_{i = 1}^{n} \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_s) \\ \mathrm{lppd}_{CV} &= \sum_{i = 1}^{n} \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_{-i,s}) \end{align*}\]

\(\theta_{-i, s}\) is row \(s\) of the posterior if we fit our model with all of the data except for row \(i\).
This would be very expensive to compute (because we would need to fit lots of models).

PSIS provides a clever way to approximate this without fitting so many models. Big ideas:

Big idea: Removing one row from the data doesn’t change the posterior very much unless that row is “unexpected” by the model.
We can approximate the LOO posteriors by weighted sampling from the full posterior.
Weights are determined by how unexpected a row of the data is

\[ w_{i,s} = \frac{1}{{\sf Pr}(\mathrm{data}_i \mid \theta_s)} \]

This way of weighting is called importance sampling.
An additional step is taken to adjust the weights based on the fact that they should follow approximately a Pareto diagram. This Pareto smoothing makes things less sensitive to poor estimates of the weights, and also provides a diagnostic measure for when the approximation involved might not be very good.
Using the smoothed version of the weights we get

\[\begin{align*} \mathrm{elpd}_{\mathrm{loo}} = \mathrm{lppd}_{\mathrm{IS}} &= \sum_{i=1}^n \log \left( \frac{\sum_{s = 1}^S w_{i,s} {\sf Pr}(\mathrm{data}_i \mid \theta_s)} {\sum_{s = 1}^S w_{i,s}}\right) \\ &\approx \sum_{i = 1}^{n} \log \frac{1}{S} \sum_{s = 1}^S {\sf Pr}(\mathrm{data}_i \mid \theta_{-i,s}) \\ & = \mathrm{lppd}_{CV} \end{align*}\]

Remember: \({\sf Pr}(\mathrm{data}_i \mid \theta_s)\) is just the likelihood of row \(i\) of the the data using parameters from row \(s\) of the posterior. Remember: \({\sf Pr}(\mathrm{data}_i \mid \theta_{-i,s})\) is the same but uses the posterior distribution of a model that omits row \(i\) from the data.

Information Criteria

Below we present two information criteria that attempt to adjust deviance (\(D\)) so that it provides a better estimate for out-of-sample performance.

\(AIC = D_{\mathrm{train}} + 2p\)

This penalizes the model for each parameter, but it is only an accurate estimate of out-of-sample performance if
- priors are uniform (or very flat)
- posterior distributions are multivariate normal
- \(n >> k\) (more data than parameters)
Since flat priors are rarely best, we won’t do much with AIC.

\(\mathrm{WAIC} = -2 \mathrm{lppd}+ 2 p_{\mbox{WAIC}}\).
- \(p_{\mbox{WAIC}} = \sum_{i=1}^{n} \operatorname{Var}_{\theta} \log {\sf Pr}(\mathrm{data}_i \mid \theta)\)
- idea: The more changing a parameter (among its posterior plausible values) changes the likelihood, the more that parameter allows the model to flexibly (over)fit the data.
- So instead of counting all parameters the same (as in AIC), we penalize the model according to the overall amount of flexibility it has.
- This will be very important in some Bayesian models that have many parameters that lend relatively little flexibility to the model.
- \(p_{\mbox{WAIC}}\) is sometimes called the effective number of parameters

Computing WAIC and PSIS in R

We don’t ever want to compute these by hand. They are involved computations and involved some careful programming to avoid accumulated rounding errors.

Fortunately, R will do all the work for us. Our job will be learning how to make use of these computations.

For now, one quick example. Note that we need to tell stan to keep track of the log-likelihood information it needs to compute these.

Pallets2 <-
  fastR2::Pallets %>%
  mutate(
    emp_idx = as.numeric(factor(employee)),
    day_idx = as.numeric(factor(day))
  )
set.seed(666)
u6 <- ulam(
  data = Pallets2,
  alist(
    pallets ~ dnorm(mu, sigma),
    mu <- b[emp_idx] + d[day_idx], 
    b[emp_idx] ~ dnorm(125, 20),
    d[day_idx] ~ dnorm(0, 10),
    sigma ~ dexp(1)
  ),
  chains = 4, iter = 4000, warmup = 1000, cores = 4, refresh = 0,
  log_lik = TRUE,            # NB: need to tell stand to keep track of log-likelihood
  file = "fits/test2-u6")

u6 %>% stanfit() %>% loo::loo()

## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

## 
## Computed from 12000 by 20 log-likelihood matrix
## 
##          Estimate  SE
## elpd_loo    -46.9 4.2
## p_loo         9.1 2.7
## looic        93.9 8.4
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     14    70.0%   1940      
##  (0.5, 0.7]   (ok)        3    15.0%   1282      
##    (0.7, 1]   (bad)       3    15.0%   23        
##    (1, Inf)   (very bad)  0     0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.

u6 %>% stanfit() %>% loo::extract_log_lik() %>% waic()

## Warning: 
## 5 (25.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.

## 
## Computed from 12000 by 20 log-likelihood matrix
## 
##           Estimate  SE
## elpd_waic    -45.3 3.5
## p_waic         7.5 2.0
## waic          90.7 7.0
## 
## 5 (25.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.

u6a <- u6
rethinking::compare(u6, u6a)

##         WAIC       SE dWAIC dSE.u6    pWAIC weight
## u6  90.69338 6.831297     0     NA 7.493209    0.5
## u6a 90.69338 6.831297     0      0 7.493209    0.5

Something(s) better than \(R^2\)

Information, Entropy, and estimating out of sample prediction accuracy

Stat 341