Simulating King Markov

Let’s let the computer simulate this algorithm. And since the computer is doing all the work, let’s make a general function so we can experiment a bit.

KingMarkov <- function(
  num_steps    = 1e5,
  population   = 1:5,
  island_names = 1:length(population),
  start        = 1,
  J = function(a, b) {1 / (length(population) - 1)}
  ) {
  
  num_islands  <- length(population)
  island_seq   <- rep(NA, num_steps)  # trick to pre-allocate memory
  proposal_seq <- rep(NA, num_steps)  # trick to pre-allocate memory
  current      <- start
  proposal     <- NA
  
  for (i in 1:num_steps) {
    # record current island
    island_seq[i]   <- current
    proposal_seq[i] <- proposal
    
    # propose one of the other islands
    other_islands <- setdiff(1:num_islands, current)
    proposal <- 
      sample(other_islands, 1, 
             prob = purrr::map(other_islands, ~ J(current, .x)))
    # move? (assumes J() is symmetric)
    prob_move <- population[proposal] / population[current]
    
    # new current island (either current current or proposal)
    current   <- ifelse(runif(1) < prob_move, proposal, current)
  }
  tibble(
    step     = 1:num_steps,
    island   = island_names[island_seq],
    proposal = island_names[proposal_seq]
  )
}

Question 1: Look at the code above and answer the following.

  • What are the default populations of the islands?

  • What is the default jump rule?

  • Explain what each of the following bits of code are doing:

    • other_islands <- setdiff(1:num_islands, current)
    • prob = purrr::map(other_islands, ~ J(current, .x))
    • the call to sample()
    • prob_move <- population[proposal] / population[current]
    • current <- ifelse(runif(1) < prob_move, proposal, current)
    • the call to tibble()

Jumping to any island

Now let’s simulate the first 5000 nights of King Markov’s reign.

Tour <- KingMarkov(5000) 
Target <- tibble(island = 1:5, prop = (1:5)/ sum(1:5))
gf_line(island ~ step, data = Tour %>% filter(step <= 200)) %>%
  gf_point(proposal ~ step, data = Tour %>% filter(step <= 200),
           color = "red", alpha = 0.4) %>%
  gf_refine(scale_y_continuous(breaks = 1:10)) 
## Warning: Removed 1 rows containing missing values (geom_point).

gf_dhistogram( ~ island, data = Tour, binwidth = 1, color = "black") %>%
  gf_point(prop ~ island, data = Target, color = "red") %>%
  gf_refine(scale_x_continuous(breaks = 1:10))

Question 2: Look at the first plot. It shows where the king stayed each of the first 200 nights.

  • Did the king ever stay two consecutive nights on the same island? (How can you tell from the plot?)

  • Did the king ever stay two consecutive nights on the smallest island? (How can you answer this without looking at the plot?)

Jumping only to neighbor islands

What if we only allow jumping to neighboring islands? (Imagine the islands are arranged in a circle and we only sail clockwise or counterclockwise around the circle to the nearest island in one direction or the other.)

neighbor <- function(a, b) as.numeric(abs(a-b) %in% c(1,4)) 
Tour <- KingMarkov(10000, population = 1:5, J = neighbor) 
Target <- tibble(island = 1:5, prop = (1:5)/ sum(1:5))
gf_line(island ~ step, data = Tour %>% filter(step <= 200)) %>%
  gf_refine(scale_y_continuous(breaks = 1:10))

gf_dhistogram( ~ island, data = Tour, binwidth = 1) %>%
  gf_point(prop ~ island, data = Target, color = "red") %>%
  gf_refine(scale_x_continuous(breaks = 1:10))

The effect of visiting only neighbors is more dramatic with more islands.

neighbor <- function(a, b) as.numeric(abs(a-b) %in% c(1,9)) 

Tour1 <- KingMarkov(10000, population = 1:10) 
Tour2 <- KingMarkov(10000, population = 1:10, J = neighbor) 
Target <- tibble(island = 1:10, prop = (1:10)/ sum(1:10))
gf_line(island ~ step, data = Tour1 %>% filter(step <= 200)) %>%
  gf_refine(scale_y_continuous(breaks = 1:10)) %>%
  gf_labs(title = "Jump to any island") /
gf_line(island ~ step, data = Tour2 %>% filter(step <= 200)) %>%
  gf_refine(scale_y_continuous(breaks = 1:10)) %>%
  gf_labs(title = "Only jump to a neighbor island") 

gf_dhistogram( ~ island, data = Tour1, binwidth = 1) %>%
  gf_point(prop ~ island, data = Target, color = "red") %>%
  gf_refine(scale_x_continuous(breaks = 1:10)) %>%
  gf_labs(title = "Jump to any island") /
gf_dhistogram( ~ island, data = Tour2, binwidth = 1) %>%
  gf_point(prop ~ island, data = Target, color = "red") %>%
  gf_refine(scale_x_continuous(breaks = 1:10)) %>% 
  gf_labs(title = "Only jump to a neighbor island") 

Markov Chains and Posterior Sampling

That was a nice story, and some nice probability theory. But what does it have to do with Bayesian computation? Regular Markov Chains (and some generalizations of them) can be used to sample from a posterior distribution:

  • state = island = set of parameter values
    • in typical applications, this will be an infinite state space
  • population = prior * likelihood
    • importantly, we do not need to normalize the posterior; that would typically be a very computationally expensive thing to do
  • start in any island = start at any parameter values
    • convergence may be faster from some starting states than from others, but in principle, any state will do
  • randomly choose a proposal island = randomly select a proposal set of parameter values
    • if the posterior is greater there, move
    • if the posterior is smaller, move anyway with probability

\[A = \frac{\mbox{proposal `posterior'}}{\mbox{current `posterior'}}\]

Metropolis-Hastings variation:

  • More choices for \(J()\) (need not be symmetric) gives more opportunity to tune for convergence

Other variations:

  • Can allow \(M\) to change over the course of the algorithm. (No longer time-homogeneous.)

Example 1: Estimating a proportion

To see how this works in practice, let’s consider our familiar model that has a Bernoulli likelihood and a beta prior:

  • \(Y_i \sim {\sf Bern}(\theta) = {\sf Binom}(1, \theta)\)
  • \(\theta \sim {\sf Beta}(a, b)\)

Since we are already familiar with situation, we know that the posterior should be a beta distribution when the prior is a beta distribution. We can use this information to see how well the algorithm works in that situation.

Let’s code up our Metropolis algorithm for this situation. There is a new wrinkle, however: The state space for the parameter is a continuous interval [0,1]. So we need a new kind of jump rule

  • Instead of sampling from a finite state space, we use rnorm()

  • The standard deviation of the normal distribution (called step_size in the code below) controls how large a step we take (on average). This number has nothing to do with the model, it is a tuning parameter of the algorithm.

metro_bern <- function(
  x, n,            # x = successes, n = trials
  step_size = 0.01,     # sd of jump distribution
  start = 0.5,     # value of theta to start at
  num_steps = 1e4, # number of steps to run the algorithm
  prior = dunif,   # function describing prior
  ...              # additional arguments for prior
  ) {
  
  theta             <- rep(NA, num_steps)  # trick to pre-allocate memory
  proposed_theta    <- rep(NA, num_steps)  # trick to pre-allocate memory
  move              <- rep(NA, num_steps)  # trick to pre-allocate memory
  theta[1]          <- start
 
  for (i in 1:(num_steps-1)) {
    # head to new "island"
    proposed_theta[i + 1] <- rnorm(1, theta[i], step_size)
    
    if (proposed_theta[i + 1] <= 0 ||
        proposed_theta[i + 1] >= 1) {
      prob_move <- 0           # because prior is 0
    } else {
      current_prior       <- prior(theta[i], ...)
      current_likelihood  <- dbinom(x, n, theta[i])
      current_posterior   <- current_prior * current_likelihood
      proposed_prior      <- prior(proposed_theta[i+1], ...)
      proposed_likelihood <- dbinom(x, n, proposed_theta[i+1])
      proposed_posterior  <- proposed_prior * proposed_likelihood
      prob_move           <- proposed_posterior / current_posterior
    }
    
    
    # sometimes we "sail back"
    if (runif(1) > prob_move) { # sail back
       move[i + 1] <- FALSE
      theta[i + 1] <- theta[i]
    } else {                    # stay
       move[i + 1] <- TRUE
      theta[i + 1] <- proposed_theta[i + 1]
    }
  }
  
  tibble(
    step = 1:num_steps, 
    theta = theta,
    proposed_theta = proposed_theta,
    move = move, step_size = step_size
  )
}

Question 3: What happens if the proposed value for \(\theta\) is not in the interval \([0,1]\)? Why?

Question 4: What do proposed_posterior and current_posterior correspond to in the story of King Markov?

Question 5: Notice that we are using the unnormalized posterior. Why don’t we need to normalize the posterior? Why is it important that we don’t have to normalize the posterior?

Looking at posterior samples

The purpose of all this was to get samples from the posterior distribution. We can use histograms or density plots to see what our MCMC algorithm shows us for the posterior distribution. When using MCMC algorithms, we won’t typically have ways of knowing the “right answer”. But in this case, we know the posterior is Beta(6, 11), so we can compare our posterior samples to that distribution to see how well things worked.

Let’s try a nice small step size like 0.2%.

set.seed(341)
Tour <- metro_bern(5, 15, step_size = 0.002)
gf_dhistogram(~ theta, data = Tour, bins = 100) %>%
  gf_dens(~ theta, data = Tour) %>%
  gf_dist("beta", shape1 = 6, shape2 = 11, color = "red")

Hmm. That’s not too good. Let’s see if we can figure out why.

Trace Plots

A trace plot shows the “tour of King Markov’s ship” – the sequence of parameter values sampled (in order).

gf_line(theta ~ step, data = Tour) %>% 
  gf_hline(yintercept = ~ 5/15, color = "red") 

Question 6: Why does the trace plot begin at a height of 0.5?

Question 7: What features of this trace plot indicate that our posterior sampling isn’t working well? Would making the step size larger or smaller be more likely to help?

Question 8: Since we know that the posterior distribution is Beta(6, 11), how could we use R to show us what an ideal trace plot would look like?

Comparing step sizes

Let’s see how our choice of step affects the sampling. Size 0 is sampling from a true Beta(6, 11) distribution.

set.seed(341)
Tours <- 
  bind_rows(
    metro_bern(5, 15, step_size = 0.02),
    metro_bern(5, 15, step_size = 0.2),
    metro_bern(5, 15, step_size = 0.002),
    tibble(theta = rbeta(1e4, 6, 11), step_size = 0, step = 1:1e4)
  )
gf_dhistogram( ~ theta | step_size ~ ., data = Tours, bins = 100) %>%
  gf_dist("beta", shape1 = 6, shape2 = 11, color = "red")

Sizes 0.2 and 0.02 look much better than 0.002.

Looking at the trace plots, we see that the samples using a step size of 0.02 are still highly auto-correlated (neighboring values are similar to each other). Thus the “effective sample size” is not nearly as big as the number of posterior samples we have generated.

With a step size of 0.2, the amount of auto-correlation is substantially reduced, but not eliminated.

gf_line(theta ~ step, data = Tours) %>%
  gf_hline(yintercept = ~ 5/14, color = "red") %>% 
  gf_facet_grid(step_size ~ .) %>%
  gf_lims(x = c(0,1000))
## Warning: Removed 9000 row(s) containing missing values (geom_path).

Effective Sample Size

We are going to talk about effective sample size a lot as we start using MCMC algorithms to fit our models. So it is important to understand what this phrase means:

A sample has effective sample size \(s\) if it as useful for a particular purpose as an independent randome sample of \(s\) would be for that purpose.

For a particular purpose is an important phrase, since the same sample can have different effective sample sizes when used for different purposes.

Our purposes will typically involve estimating something related to a posterior distribution: the mean or the upper and lower limits of an HDI, for example. If we knew the posterior distribuiton, we could sample indpendently from the posterior and use our posterior samples to make the estimate. If we do this repeatedly, we won’t get the same answer every time – there will be sampling variability. The larger the sample size, the less the sampling variability.

We will say that a posterior sample of size \(n\) has effective sample size \(s\) for a particular estimate if it has the same amount of sampling variability as an independent sample of size \(s\).

We won’t be able to compute effective sample size exactly, but there are ways it can be estimated that gives us a good idea about the efficiency of our sampling. These methods are based on looking at auto-correlation in our samples. Auto-correlation is a way of answering this question: Are nearby rows of the posterior sample more similar to each other than purely random selections would be?

Auto-correlation

We can investigate auto-correlation using an auto-correlation (or ACF) plot.

# base graphics, rather space inefficient, and incompatible with our usual gg plots
acf(Tours %>% filter(step_size == 0.2) %>% pull(theta))

# gg version in CalvinBayes
acf_plot(Tours %>% filter(step_size == 0.2) %>% pull(theta)) %>%
  gf_labs(title = "An Autocorrelation plot -- step size = 0.2")

By way of comparison, here is an autocorrelation plot for some independent random data.

acf_plot(rnorm(1E3))

So what are these plots? Let’s take a look at a small example.

set.seed(12345)
T <- tibble(
  x = sample(1:10, 10),
  x1 = lag(x, 1),
  x2 = lag(x, 2),
  x3 = lag(x, 3)
)
T %>% pander::pander()
x x1 x2 x3
3 NA NA NA
8 3 NA NA
2 8 3 NA
5 2 8 3
10 5 2 8
9 10 5 2
7 9 10 5
6 7 9 10
4 6 7 9
1 4 6 7
cor(x ~ x,  data = T)
## [1] 1
cor(x ~ x1, data = T, use = "complete.obs")
## [1] 0.1633743
cor(x ~ x2, data = T, use = "complete.obs")
## [1] -0.09626219
cor(x ~ x3, data = T, use = "complete.obs")
## [1] -0.2333364
acf_plot(T$x)

Each bar in the acf plot represents the correlation coefficient of the input vector with a lagged version of itself. That is, we shift all the values over and compute the correlation coefficient. The first bar (lag 0) always has value 1 because there is perfect correlation between a vector and itself. Lag 1 tells us about adjacent values. Lag 2 about values 2 apart, and so on. In a random sample, all but the first of these will be fairly close to zero – just a bit of random noise. The horizontal guide lines are a crude upper bound on how large most of these (95%) would be if we had truly independent sampling and the correlations were just due to randomness. So things are good when all or most of the bars are within or near the horizontal guidelines.

Thinning

One way to reduce the auto-correlation in posterior samples is by thinning.

acf_plot(Tours %>% filter(step_size == 0.2) %>% pull(theta)) %>%
  gf_labs(title = "An Autocorrelation plot -- step size = 0.2")

This auto-correlation plot suggests that keeping every 7th or 8th value when step_size is 0.2 should give us a posterior sample that is nearly independent.

acf_plot(Tours %>% filter(step_size == 0.2, step %% 8 == 0) %>% pull(theta)) %>%
  gf_labs(title = "An Autocorrelation plot -- step size = 0.2; 1/8 thinning")

gf_line(theta ~ step, data = Tours %>% filter(step %% 8 == 0)) %>%
  gf_hline(yintercept = ~ 5/14, color = "red") %>% 
  gf_facet_grid(step_size ~ .) %>%
  gf_lims(x = c(0, 3000)) %>% 
  gf_labs(
    title = "Comparing different step sizes with 1/8 thinning"
  )
## Warning: Removed 875 row(s) containing missing values (geom_path).

Now the idealized trace plot and the trace plot for step = 0.2 are quite similar looking. So the effective sample size for that step size is approximately 1/8 the number of posterior samples generated.

For step=0.02 we would need to thin even more since there is more auto-correlation here. This would require generating many more posterior samples to begin with:

Tours %>% 
  filter(step_size == 0.02, step > 1000) %>% 
  pull(theta) %>%
  acf_plot(lag.max = 100)

Tours %>% 
  filter(step_size == 0.02, step > 1000) %>% 
  pull(theta) %>%
  acf_plot(lag.max = 1000)

Important notes

  1. We don’t need to actually thin the posterior samples to use them – we just need to understand that in effect our sample is not as large as it appears.

  2. The effective sample size depends on what we are computing. It can be different for each posterior mean, and different again for each posterior HDI boundary. The effective sample size for HDI boundaries is usually lower than for means because the tails of a distribution are less stable than center of the distribution (because we sample fewer data points in the tails).

  3. Software will compute an estimated effective sample size for us. We mainly want to check that this isn’t very small. A small effective sample size indicates inefficient sampling of the posterior and estimates that won’t be very precise.

  4. We will generally like to have our effective sample size in the 100s if not 1000s.

  • Smaller than that, and our results won’t be stable. If we rerun with a new random seed, we might get quite different results. This is not variability in the model, this is variability from that algorithm itself, and we’d like to reduce it as much as we reasonably can.

Estimating effective sample size

Here is a crude estimate of effective sample size to give you some idea how it works. Let’s let

\[ ESS = \frac{\mbox{sample size}}{1 + \sum_k 2 r_k} \] where \(r_k\) is the \(k\)th lagged correlation.

Some good things about this measure:

  • The first auto-correlation is always 1. In a perfect world, rest would be 0, so we would get \(\frac{n}{1} = n\).

  • Generally, if there is more auto-correlation, the denominator will be larger and ESS will be smaller.

  • This measure is based on estimating the mean of a normal distribution from a correlated sample. It will work less well if the distribution isn’t normal or if you are interested in something other than the mean.

  • Soon we’ll see a couple fancier versions of this (computed in software for us, so we don’t need to know all the details).

Example 2: Estimating mean and variance

Consider the following simple model:

  • \(Y_i \sim {\sf Norm}(\mu, \sigma)\)
  • \(\mu \sim {\sf Norm}(0, 1)\)
  • \(\log(\sigma) \sim {\sf Norm}(0,1)\)

In this case the posterior distribution for \(\mu\) can be worked out exactly and should be normal.

Let’s code up our Metropolis algorithm for this situation. New stuff:

  • we have two parameters, so we’ll use separate jumps for each and combine
    • we could use a jump rule based on both values together, but we’ll keep this simple
  • the state space for each parameter is infinite, so we need a new kind of jump rule
    • instead of sampling from a finite state space, we use rnorm()
    • the standard deviation controls how large a step we take (on average)
    • example below uses same standard deviation for both parameters, but we should select them individually if the parameters are on different scales
metro_norm <- function(
  y,                # data vector
  num_steps = 1e5,  
  step_size = 1,         # sd's of jump distributions
  start = list(mu = 0, log_sigma = 0)
  ) {
 
  step_size <- rep(step_size, 2)[1:2]        # make sure exactly two values
  mu        <- rep(NA, num_steps)  # trick to pre-allocate memory
  log_sigma <- rep(NA, num_steps)  # trick to pre-allocate memory
  move      <- rep(NA, num_steps)  # trick to pre-allocate memory
  mu[1] <- start$mu
  log_sigma[1] <- start$log_sigma
  move[1] <- TRUE
 
  for (i in 1:(num_steps - 1)) {
    # head to new "island"
    mu[i + 1]        <- rnorm(1, mu[i], step_size[1])
    log_sigma[i + 1] <- rnorm(1, log_sigma[i], step_size[2])
    move[i + 1] <- TRUE
    
    log_post_current <- 
      dnorm(mu[i], 0, 1, log = TRUE) + 
      dnorm(log_sigma[i], 0, 1, log = TRUE) + 
      sum(dnorm(y, mu[i], exp(log_sigma[i]), log = TRUE))
    log_post_proposal <- 
      dnorm(mu[i + 1], 0, 1, log = TRUE) + 
      dnorm(log_sigma[i + 1], 0, 1, log = TRUE) + 
      sum(dnorm(y, mu[i + 1], exp(log_sigma[i+1]), log = TRUE))
    prob_move <- exp(log_post_proposal - log_post_current)
    
    # sometimes we "sail back"
    if (runif(1) > prob_move) {  
      move[i + 1] <- FALSE
      mu[i + 1] <- mu[i]
      log_sigma[i + 1] <- log_sigma[i]
    } 
    
  }
  tibble(
    step = 1:num_steps,
    mu = mu,
    log_sigma = log_sigma,
    move = move,
    step_size = paste(step_size, collapse = ", ")
  )
}

Let’s use the algorithm with three different step size values and compare results.

set.seed(341)
y <- rnorm(25, 1, 2)  # sample of 25 from Norm(1, 2)
Tour5    <- metro_norm(y = y, num_steps = 5000, step_size = 5)
Tour1    <- metro_norm(y = y, num_steps = 5000, step_size = 1)
Tour0.1  <- metro_norm(y = y, num_steps = 5000, step_size = 0.1)
Tour0.01 <- metro_norm(y = y, num_steps = 5000, step_size = 0.01)
Norm_Tours <-  bind_rows(Tour5, Tour1, Tour0.1, Tour0.01)
df_stats(~ move | step_size, data = Norm_Tours, props)
##   response  step_size prop_FALSE prop_TRUE
## 1     move 0.01, 0.01     0.0352    0.9648
## 2     move   0.1, 0.1     0.2564    0.7436
## 3     move       1, 1     0.9064    0.0936
## 4     move       5, 5     0.9960    0.0040

Density plots

gf_dhistogram( ~ mu | step_size ~ ., data = Norm_Tours, bins = 100) 

gf_dhistogram( ~ exp(log_sigma) | step_size ~ ., data = Norm_Tours, bins = 100)

Trace plots

gf_line(mu ~ step | step_size ~ ., data = Norm_Tours) 

gf_line(log_sigma ~ step | step_size ~ ., data = Norm_Tours)

ACF plots

Norm_Tours %>% filter(step_size == "5, 5") %>% pull(mu) %>% 
  acf_plot(lag.max = 100) /
Norm_Tours %>% filter(step_size == "1, 1") %>% pull(mu) %>% 
  acf_plot(lag.max = 100) /
Norm_Tours %>% filter(step_size == "0.1, 0.1") %>% pull(mu) %>% 
  acf_plot(lag.max = 100) /
Norm_Tours %>% filter(step_size == "0.01, 0.01") %>% pull(mu) %>% 
  acf_plot(lag.max = 100) 

Comparing Multiple Chains

If we run multiple chains with different starting points and different random choices, we hope to see similar trace plots. After all, we don’t want our analysis to be an analysis of starting points or of random choices.

Tour1a <- metro_norm(y = y, num_steps = 5000, step_size = 1) %>% mutate(chain = "A")
Tour1b <- metro_norm(y = y, num_steps = 5000, step_size = 1) %>% mutate(chain = "B")
Tour1c <- metro_norm(y = y, num_steps = 5000, step_size = 1, start = list(mu = 10, log_sigma = 5)) %>% mutate(chain = "C")
Tour1d <- metro_norm(y = y, num_steps = 5000, step_size = 1, start = list(mu = 10, log_sigma = 5)) %>% mutate(chain = "D")
Tours1 <- bind_rows(Tour1a, Tour1b, Tour1c, Tour1d)
gf_line(mu ~ step, color = ~chain, alpha = 0.5, data = Tours1)

gf_line(mu ~ step, color = ~chain, alpha = 0.5, data = Tours1) %>% 
  gf_facet_grid( chain ~ .)

Comparing Chains to an Ideal Chain

Not all posteriors are normal, but here’s what a chain would look like if the posterior is normal and there is no correlation between draws.

Ideal <- tibble(step = 1:5000, mu = rnorm(5000, 1, .3), step_size = "ideal")
gf_line(mu ~ step | step_size, data = Norm_Tours %>% bind_rows(Ideal))

If the draws are correlated, then we might get more ideal behavior if we selected only a subset – every 20th or every 30th value, for example. This is the idea behind “effective sample size”. The effective sample size of a correlated chain is the length of an ideal chain that contains as much independent information as the correlated chain.

acf_plot(Norm_Tours %>% filter(step_size == "1, 1") %>% pull(mu))

gf_line(mu ~ step | step_size ~ ., data = bind_rows(Norm_Tours, Ideal) %>% filter(step %% 20 == 0)) %>%
  gf_labs(title = "Every 20th draw")

gf_line(mu ~ step | step_size ~ ., data = bind_rows(Norm_Tours, Ideal) %>% filter(step %% 30 == 0)) %>%
  gf_labs(title = "Every 30th draw")

If we thin to every 20th value, our chains with step_size = 1 and step_size = 0.1 look quite similar to the ideal chain. The chain with step_size = 0.01 still moves too slowly through the parameter space. So tuning parameters will affect the effective sample size.

Discarding the first portion of a chain

For some MCMC algorithms, the first portion of a chain may not work as well. For example, it may spend a long time in a region of low posterior density as it moves towards higher density regions and thus inflate the probabilities in this region. For algorithms with this issue, this portion is typically removed from the analysis since it is more an indication of the starting values used than the long-run sampling from the posterior. The basic Metropolis algorithm is susceptible to this issue. (The Hamiltonian Markov Chain method that Stan uses does not have this issue in the same way, but there will still be a pre-phase that is not part of the samples used to approximate the posterior.)

gf_line(log_sigma ~ step | step_size ~ ., data = Norm_Tours) 

gf_path(log_sigma ~ mu | step_size ~ (step > 500), data = Norm_Tours, alpha = 0.5) %>%
  gf_density2d(log_sigma ~ mu, data = Norm_Tours) 
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf

gf_histogram( ~ mu | step_size ~ (step > 500) , data = Norm_Tours, binwidth = 0.1)

gf_histogram( ~ log_sigma | step_size ~ (step > 500), data = Norm_Tours, binwidth = 0.1)

Issues with Metropolis Algorithm

Similar issues arise in many MCMC algorithms – fancier versions attempt to minimize the impact of these issues.

  • First portion of a chain might not be very good, so we need to discard it
  • Tuning can affect performance – how do we tune?
  • Samples are correlated – although the long-run probabilities are right, the next stop is not independent of the current one
    • so our effective posterior sample size isn’t as big as it appears

Two improvements over basic Metropolis

Gibbs Sampling

Gibbs sampling modifies the Metropolis algorithm by changing how the proposal is selected. It only works in situations where the marginal posterior distributions are known (as is the case when we use conjugate priors).

The idea:

  • At each step, pick one parameter to change.

  • Select a propasal for that parameter by randomly sampling from the marginal posterior for that parameter, assuming all the other parameters are as in the current parameter values.

What’s good about this:

  • We are using information about the posterior to help us find good proposals.
  • Works well (much better than naive Metropolis) in many situations.
  • Has been coded up into tools like JAGS (Just Another Gibbs Sampler). Wrappers exist for R, Python, etc.

What’s bad about this:

  • Restricts the priors we can use.
  • Doesn’t handle correlated parameters very well.
  • Hard to diagnose whether it is working well or poorly.

Hamiltonian Markov Chains (HMC)

The idea:

  • Consider the negative log posterior as a surface.
  • Flick a frictionless puck from the current location in a random direction with a random force and let is go for a random amount of time.
  • Simulate the motion of the puck using physics by breaking the path into little pieces using the value of the log-posterior and its Hessian (partial derivatives in each direction).
  • Proposal is where the puck is when time runs out.
  • “Never” reject a proposal.

The good:

  • Rarely reject a proposal, improving efficiency (see the bad for why “rarely” and not “never”).

  • Handles wider range of priors and likelihoods.

  • Does better (than Gibbs) with correlated parameters (but things still work better when there is less correlation).

  • Easier to diagnose common problems, so we (often) know when the algorithm is not working well an can try something else.

  • Coded in Stan. Wrappers exist for R, Python, etc.

The bad:

  • Each proposal take longer to compute (we have to simulate physics).

  • Lots of tuning parameters: how hard to flick, what direction, how long to let puck slide, how many little pieces to break the path into, etc., etc.

  • The physics simulation doesn’t always work out numerically stably, so we do actually reject sometimes. (Conservation of energy can be checked, and if it doesn’t hold well enough, we reject the proposal.)

This is the method we will be migrating to as our models become more complicated.

Another Example: Two coins

The model

Suppose we have two coins and want to compare the proportion of heads each coin generates.

  • Parameters: \(\theta_1\), \(\theta_2\)
  • Data: \(N_1\) flips of coin 1 and \(N_2\) flips of coin 2. (Results: \(z_1\) and \(z_2\) heads, respectively.)
  • Likelihood: independent Bernoulli (each flip independent of the other flips, each coin independent of the other coin)
  • Prior: Independent Beta distributions for each \(\theta_i\)

That is,

  • \(y_{1i} \sim {\sf Bern}(\theta_1), y_{2i} \sim {\sf Bern}(\theta_2)\)
  • \(\theta_1 \sim {\sf Beta(a_1, b_1)}, \theta_2 \sim {\sf Beta(a_2, b_2)}\) (independent)

For the examples below we will use data showing that 6 of 8 tosses of the first coin were heads and only 2 of 7 tosses of the second coin. But the methods work equally well with other data sets. While comparing a small number of coin tosses like this is not so interesting, the method can be used for a wide range of practical things, liking testing whether two treatments for a condition are equally effective, etc.

Exact analysis

From this we can work out the posterior as usual. (Pay attention to the important parts of the kernel, that is, the numerators.)

\[\begin{align*} p(\theta_1, \theta_2 \mid D) &= p(D \mid \theta_1, \theta_2) p(\theta_1, \theta_2) / p(D) \\[3mm] &= \theta_1^{z_1} (1 − \theta_1)^{N_1−z_1} \theta_1^{z_2} (1 − \theta_2)^{N_2−z_2} p(\theta_1, \theta_2) / p(D) \\[3mm] &= \frac{ \theta_1^{z_1} (1 − \theta_1)^{N_1 − z_1} \theta_1^{z_2} (1 − \theta_2)^{N_2 − z_2} \theta_1^{a_1−1}(1 − \theta_1)^{b_1 - 1} \theta_2^{a_2−1}(1 − \theta_2)^{b_2 - 1}} {p(D)B(a_1, b_1)B(a_2, b_2)} \\[3mm] &= \frac{ \theta_1^{z_1 + a_1 − 1}(1 − \theta_1)^{N_1 − z_1 + b_1 − 1} \theta_2^{z_2 + a_2 − 1}(1 − \theta_2)^{N_2 − z_2 + b_2 − 1} }{p(D)B(a_1, b_1)B(a_2, b_2)} \end{align*}\]

So the posterior distribution of \(\langle \theta_1, \theta_2 \rangle\) is two independent Beta distributions: \({\sf Beta}(z_1 + a, N_1 - z_1 + b_1)\) and \({\sf Beta}(z_2 + a, N_2 - z_2 + b_2)\).

Some nice images of these distributions appear on page 167.

Metropolis

metro_2coins <- function(
  z1, n1,              # z = successes, n = trials
  z2, n2,              # z = successes, n = trials
  step_size  = c(0.1, 0.1), # sds of jump distribution
  start = c(0.5, 0.5), # value of thetas to start at
  num_steps = 5e4,     # number of steps to run the algorithm
  prior1 = dbeta,      # function describing prior
  prior2 = dbeta,      # function describing prior
  args1 = list(),      # additional args for prior1
  args2 = list()      # additional args for prior2
  ) {
  
  theta1            <- rep(NA, num_steps)  # trick to pre-allocate memory
  theta2            <- rep(NA, num_steps)  # trick to pre-allocate memory
  proposed_theta1   <- rep(NA, num_steps)  # trick to pre-allocate memory
  proposed_theta2   <- rep(NA, num_steps)  # trick to pre-allocate memory
  move              <- rep(NA, num_steps)  # trick to pre-allocate memory
  theta1[1]         <- start[1]
  theta2[1]         <- start[2]

  step_size1 <- step_size[1] 
  step_size2 <- step_size[2] 
  
  for (i in 1:(num_steps-1)) {
    # head to new "island"
    proposed_theta1[i + 1] <- rnorm(1, theta1[i], step_size1)
    proposed_theta2[i + 1] <- rnorm(1, theta2[i], step_size2)
    
    if (proposed_theta1[i + 1] <= 0 ||
        proposed_theta1[i + 1] >= 1 ||
        proposed_theta2[i + 1] <= 0 ||
        proposed_theta2[i + 1] >= 1) {
      proposed_posterior <- 0  # because prior is 0
    } else {
      current_prior <- 
        do.call(prior1, c(list(theta1[i]), args1)) *
        do.call(prior2, c(list(theta2[i]), args2))
      current_likelihood  <- 
        dbinom(z1, n1, theta1[i]) *
        dbinom(z2, n2, theta2[i])
      current_posterior   <- current_prior * current_likelihood
      
      proposed_prior <- 
        do.call(prior1, c(list(proposed_theta1[i+1]), args1)) *
        do.call(prior2, c(list(proposed_theta2[i+1]), args2))
      proposed_likelihood  <- 
        dbinom(z1, n1, proposed_theta1[i+1]) *
        dbinom(z2, n2, proposed_theta2[i+1])
      proposed_posterior   <- proposed_prior * proposed_likelihood
    }
    prob_move           <- proposed_posterior / current_posterior
    
    # sometimes we "sail back"
    if (runif(1) > prob_move) { # sail back
       move[i + 1] <- FALSE
      theta1[i + 1] <- theta1[i]
      theta2[i + 1] <- theta2[i]
    } else {                    # stay
       move[i + 1] <- TRUE
      theta1[i + 1] <- proposed_theta1[i + 1]
      theta2[i + 1] <- proposed_theta2[i + 1]
    }
  }
  
  tibble(
    step = 1:num_steps, 
    theta1 = theta1,
    theta2 = theta2,
    proposed_theta1 = proposed_theta1,
    proposed_theta2 = proposed_theta2,
    move = move, 
    step_size1 = step_size1,
    step_size2 = step_size2
  )
}
Metro_2coinsA <-
  metro_2coins(
    z1 = 6, n1 = 8,
    z2 = 2, n2 = 7,
    step_size = c(0.02, 0.02),
    args1 = list(shape1 = 2, shape2 = 2), args2 = list(shape1 = 2, shape2 = 2)  
  )

Metro_2coinsA %>%
  gf_density2d(theta2 ~ theta1)

Metro_2coinsA %>%
  gf_density(~ (theta2 - theta1))

# effective sample size is much smaller than apparent sample size due to auto-correlation
acf(Metro_2coinsA$theta2 - Metro_2coinsA$theta1)

Metro_2coinsA %>% filter(step < 500) %>%
  gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
  gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5) 

Metro_2coinsB <-
  metro_2coins(
    z1 = 6, n1 = 8,
    z2 = 2, n2 = 7,
    step_size = c(0.2, 0.2),
    args1 = list(shape1 = 2, shape2 = 2), args2 = list(shape1 = 2, shape2 = 2)  
  )

Metro_2coinsB %>%
  gf_density2d(theta2 ~ theta1)

Metro_2coinsB %>%
  gf_density(~ (theta2 - theta1))

# effective sample size is better but still quite a bit
# smaller than apparent sample size due to auto-correlation
acf(Metro_2coinsB$theta2 - Metro_2coinsB$theta1)

Metro_2coinsB %>% filter(step < 500) %>%
  gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
  gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5) 

Gibbs sampling

Gibbs sampling provides an attempt to improve on the efficiency of the standard Metropolis algorithm by using a different method to propose new parameter values. The idea is this:

  • Pick one of the parameter values: \(\theta_i\)

  • Determine the posterior distribution of \(\theta_i\) using current estimates of the other parameters \(\{\theta_j \mid j \neq i\}\)

    • This won’t be exactly right, because those estimates are not exactly right, but it should be good when the parameters estimates are close to correct.
  • Sample from the posterior distribution for \(\theta_i\) to get a new proposed value for \(\theta_i\).

    • Always accept this proposal, since it is being sampled from a distribution that already makes more likely values more likely to be proposed.
  • Keep cycling through all the parameters, each time updating one using the current estimates for the others.

It takes some work to show that this also converges, and in practice it is often more efficient than the basic Metropolis algorithm. But it is limited to situations where the marginal posterior can be be sampled from.

In our example, we can work out the marginal posterior fairly easily:

\[\begin{align*} p(\theta_1|\theta_2, D) &= p(\theta_1, \theta_2|D)/p(\theta_2|D) \\ &= p(\theta_1, \theta_2|D) \int d\theta_1 p(\theta_1, \theta_2|D) \\ &= \frac{\mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 −z_1 + b_1) \cdot \mathrm{dbeta}(\theta_2, z_2 + a_2, N_2 −z_2 + b_2)} {\int d\theta_1 \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 −z_1 +b_1) \cdot \mathrm{dbeta}(\theta_2 \mid z_2 +a_2, N_2 −z_2 +b_2)} \\ &= \frac{\mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1) \cdot \mathrm{dbeta}(\theta_2, z_2 + a_2, N_2 − z_2 + b_2)} {\mathrm{dbeta}(\theta_2 \mid z_2 + a_2, N_2 −z_2 +b_2) \int d\theta_1 \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1)} \\ &= \frac{\mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1)} {\int d\theta_1 \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1)} \\ &= \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1) \end{align*}\]

In other words, \(p(\theta_1 \mid \theta_2, D) = p(\theta_1, D)\), which isn’t surprising since we already know that the posterior distributions \(\theta_1\) and \(\theta_2\) are independent. (In more complicated models, the marginal posterior may depend on all of the parameters, so the calculation above might not be so simple.) Of course, the analogous statement holds for \(\theta_2\) in this model.

This examples shows off Gibbs sampling in a situation where it shines: when the posterior distribution is a collection of independent marginals.

gibbs_2coins <- function(
  z1, n1,              # z = successes, n = trials
  z2, n2,              # z = successes, n = trials
  start = c(0.5, 0.5), # value of thetas to start at
  num_steps = 1e4,     # number of steps to run the algorithm
  a1, b1,              # params for prior for theta1
  a2, b2               # params for prior for theta2
  ) {
  
  theta1            <- rep(NA, num_steps)  # trick to pre-allocate memory
  theta2            <- rep(NA, num_steps)  # trick to pre-allocate memory
  theta1[1]         <- start[1]
  theta2[1]         <- start[2]
  
  for (i in 1:(num_steps-1)) {
    if (i %% 2 == 1) { # update theta1
      theta1[i+1] <- rbeta(1, z1 + a1, n1 - z1 + b1)
      theta2[i+1] <- theta2[i]
    } else {           # update theta2
      theta1[i+1] <- theta1[i]
      theta2[i+1] <- rbeta(1, z2 + a2, n2 - z2 + b2)
    }
  }
  
  tibble(
    step = 1:num_steps, 
    theta1 = theta1,
    theta2 = theta2,
  )
}
Gibbs <-
  gibbs_2coins(z1 = 6, n1 = 8,
               z2 = 2, n2 = 7,
               a1 = 2, b1 = 2, a2 = 2, b2 = 2)  

Gibbs %>% gf_density2d(theta2 ~ theta1)

Gibbs %>% gf_dens( ~ theta1)

acf(Gibbs$theta1)

Gibbs %>% filter(step < 500) %>%
  gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
  gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5) 

Note: Software like JAGS only records results each time it makes a complete cycle through the parameters. We could do that by keeping every other row:

Gibbs %>% filter(step %% 2 == 0) %>% gf_density2d(theta2 ~ theta1)

Gibbs %>% filter(step %% 2 == 0) %>% gf_density( ~ (theta2 - theta1))

Gibbs %>% filter(step %% 2 == 0) %>% 
  mutate(difference = theta2 - theta1) %>%
  pull(difference) %>%
  acf()

Gibbs %>% filter(step < 500, step %% 2 == 0) %>%
  gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
  gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5) 

Advantages and Disadvantages of Gibbs Sampling

Advantages
  • No need to tune. The performance of the Metropolis algorithm depends on the jump rule used. Gibbs “auto-tunes” by using the marginal posteriors.
  • Gibbs sampling has the potential to sample much more efficiently than Metropolis. It doesn’t propose updates only to reject them, and doesn’t suffer from poor tuning that can make Metropolis search very slowly through the posterior space.
Disadvantages
  • Less general: We need to be able to sample from the marginal posterior.
  • Gibbs sampling can be slow if the posterior has highly correlated parameters since given all but one of them, the algorithm with be very certain about the remaining one and adjust it only a very small amount. It is hard for Gibbs samplers to quickly move along “diagonal ridges” in the posterior.

The good news is that other people have done the hard work of coding up Gibbs sampling for us, we just need to learn how to describe the model in a way that works with their code. We will be using JAGS (just another Gibbs Sampler) and later Stan (which generalizes the metropolis algorithm in a different way) to fit more interesting models that would be too time consuming to custom code on our own. The examples in the chapter are to help us understand better how these algorithms work so we can interpret the results we obtain when using them.

So what do we learn about the coins?

We have seen that we can fit this model a number of different ways now, but we haven’t actually looked at the results. Are the coins different? How much different?

Since the Gibbs sampler is more efficient than the Metropolis algorithm for this situation, we’ll use the posterior samples from the Gibbs sampler to answer. We are interested in \(\theta_2 - \theta_1\), the difference in the biases of the two coins. To learn about that, we simply investigate the distribution of that quantity in our posterior samples. (Note: You cannot learn about this difference by only considering \(\theta_1\) and \(\theta_2\) separately.)

gf_dhistogram(~(theta2 - theta1), data = Gibbs) %>%
  gf_dens()

hdi(Gibbs %>% mutate(difference = theta2 - theta1), pars = "difference")
##          par         lo         hi       mode prob
## 1 difference -0.6627973 0.07687787 -0.3758359 0.95
mosaic::prop(~(theta2 - theta1 > 0), data = Gibbs)
## prop_TRUE 
##    0.0643

We don’t have much data, so the posterior distribution of the difference in biases is pretty wide. 0 is within the 95% HDI for \(\theta_2 - \theta_1\), so we don’t have compelling evidence that the two biases are different based on this small data set.

MCMC posterior sampling: Big picture

MCMC = Markov chain Monte Carlo

  • Markov chain because the process is a Markov process with probabilities of transitioning from one state to another
  • Monte Carlo because it involves randomness. (Monte Carlo is a famous casino location.)

We will refer to one random walk starting from a particular starting value as a chain.

Posterior sampling: Random walk through the posterior

Our goal, whether we use Metropolis, Gibbs sampling, Stan, or some other MCMC posterior sampling method, is to generate random values from the posterior distribution. Ideally these samples should be

  • representative of the posterior and not of other artifacts like the starting location of the chain, or tuning parameters used to generate the walk.

  • accurate to the posterior. If we run the algorithm multiple times, we would like the results to be similar from run to run. (They won’t match exactly, but if they are all giving good approximations to the same thing, then they should all close to each other.)

  • efficient. The theory says that all of these methods converge to the correct posterior distribution, but in practice, we can only do a finite run. If the sampling is not efficient enough, our finite run might give us a distorted picture (that would eventually have been corrected had we run things long enough).

If we are convinced that the posterior sampling is of high enough quality, we can use the posterior samples to answer all sorts of questions relatively easily.

Where do we go from here?

  1. Learn to design more interesting models to answer more interesting questions.

  2. Learn to describe these models and hand them to Stan (or some other MCMC algorithm) for posterior sampling.

  3. Learn to diagnose posterior samples to detect potential problems with the posterior sampling.

  • We don’t want to interpret our results unless the algorithm has worked well enough that it’s output can be trusted.
  1. Learn how to interpret the results.