Let’s let the computer simulate this algorithm. And since the computer is doing all the work, let’s make a general function so we can experiment a bit.
KingMarkov <- function(
num_steps = 1e5,
population = 1:5,
island_names = 1:length(population),
start = 1,
J = function(a, b) {1 / (length(population) - 1)}
) {
num_islands <- length(population)
island_seq <- rep(NA, num_steps) # trick to pre-allocate memory
proposal_seq <- rep(NA, num_steps) # trick to pre-allocate memory
current <- start
proposal <- NA
for (i in 1:num_steps) {
# record current island
island_seq[i] <- current
proposal_seq[i] <- proposal
# propose one of the other islands
other_islands <- setdiff(1:num_islands, current)
proposal <-
sample(other_islands, 1,
prob = purrr::map(other_islands, ~ J(current, .x)))
# move? (assumes J() is symmetric)
prob_move <- population[proposal] / population[current]
# new current island (either current current or proposal)
current <- ifelse(runif(1) < prob_move, proposal, current)
}
tibble(
step = 1:num_steps,
island = island_names[island_seq],
proposal = island_names[proposal_seq]
)
}
Question 1: Look at the code above and answer the following.
What are the default populations of the islands?
What is the default jump rule?
Explain what each of the following bits of code are doing:
other_islands <- setdiff(1:num_islands, current)
prob = purrr::map(other_islands, ~ J(current, .x))
sample()
prob_move <- population[proposal] / population[current]
current <- ifelse(runif(1) < prob_move, proposal, current)
tibble()
Now let’s simulate the first 5000 nights of King Markov’s reign.
Tour <- KingMarkov(5000)
Target <- tibble(island = 1:5, prop = (1:5)/ sum(1:5))
gf_line(island ~ step, data = Tour %>% filter(step <= 200)) %>%
gf_point(proposal ~ step, data = Tour %>% filter(step <= 200),
color = "red", alpha = 0.4) %>%
gf_refine(scale_y_continuous(breaks = 1:10))
## Warning: Removed 1 rows containing missing values (geom_point).
gf_dhistogram( ~ island, data = Tour, binwidth = 1, color = "black") %>%
gf_point(prop ~ island, data = Target, color = "red") %>%
gf_refine(scale_x_continuous(breaks = 1:10))
Question 2: Look at the first plot. It shows where the king stayed each of the first 200 nights.
Did the king ever stay two consecutive nights on the same island? (How can you tell from the plot?)
Did the king ever stay two consecutive nights on the smallest island? (How can you answer this without looking at the plot?)
What if we only allow jumping to neighboring islands? (Imagine the islands are arranged in a circle and we only sail clockwise or counterclockwise around the circle to the nearest island in one direction or the other.)
neighbor <- function(a, b) as.numeric(abs(a-b) %in% c(1,4))
Tour <- KingMarkov(10000, population = 1:5, J = neighbor)
Target <- tibble(island = 1:5, prop = (1:5)/ sum(1:5))
gf_line(island ~ step, data = Tour %>% filter(step <= 200)) %>%
gf_refine(scale_y_continuous(breaks = 1:10))
gf_dhistogram( ~ island, data = Tour, binwidth = 1) %>%
gf_point(prop ~ island, data = Target, color = "red") %>%
gf_refine(scale_x_continuous(breaks = 1:10))
The effect of visiting only neighbors is more dramatic with more islands.
neighbor <- function(a, b) as.numeric(abs(a-b) %in% c(1,9))
Tour1 <- KingMarkov(10000, population = 1:10)
Tour2 <- KingMarkov(10000, population = 1:10, J = neighbor)
Target <- tibble(island = 1:10, prop = (1:10)/ sum(1:10))
gf_line(island ~ step, data = Tour1 %>% filter(step <= 200)) %>%
gf_refine(scale_y_continuous(breaks = 1:10)) %>%
gf_labs(title = "Jump to any island") /
gf_line(island ~ step, data = Tour2 %>% filter(step <= 200)) %>%
gf_refine(scale_y_continuous(breaks = 1:10)) %>%
gf_labs(title = "Only jump to a neighbor island")
gf_dhistogram( ~ island, data = Tour1, binwidth = 1) %>%
gf_point(prop ~ island, data = Target, color = "red") %>%
gf_refine(scale_x_continuous(breaks = 1:10)) %>%
gf_labs(title = "Jump to any island") /
gf_dhistogram( ~ island, data = Tour2, binwidth = 1) %>%
gf_point(prop ~ island, data = Target, color = "red") %>%
gf_refine(scale_x_continuous(breaks = 1:10)) %>%
gf_labs(title = "Only jump to a neighbor island")
That was a nice story, and some nice probability theory. But what does it have to do with Bayesian computation? Regular Markov Chains (and some generalizations of them) can be used to sample from a posterior distribution:
\[A = \frac{\mbox{proposal `posterior'}}{\mbox{current `posterior'}}\]
Metropolis-Hastings variation:
Other variations:
To see how this works in practice, let’s consider our familiar model that has a Bernoulli likelihood and a beta prior:
Since we are already familiar with situation, we know that the posterior should be a beta distribution when the prior is a beta distribution. We can use this information to see how well the algorithm works in that situation.
Let’s code up our Metropolis algorithm for this situation. There is a new wrinkle, however: The state space for the parameter is a continuous interval [0,1]. So we need a new kind of jump rule
Instead of sampling from a finite state space, we use rnorm()
The standard deviation of the normal distribution (called step_size
in the code below) controls how large a step we take (on average). This number has nothing to do with the model, it is a tuning parameter of the algorithm.
metro_bern <- function(
x, n, # x = successes, n = trials
step_size = 0.01, # sd of jump distribution
start = 0.5, # value of theta to start at
num_steps = 1e4, # number of steps to run the algorithm
prior = dunif, # function describing prior
... # additional arguments for prior
) {
theta <- rep(NA, num_steps) # trick to pre-allocate memory
proposed_theta <- rep(NA, num_steps) # trick to pre-allocate memory
move <- rep(NA, num_steps) # trick to pre-allocate memory
theta[1] <- start
for (i in 1:(num_steps-1)) {
# head to new "island"
proposed_theta[i + 1] <- rnorm(1, theta[i], step_size)
if (proposed_theta[i + 1] <= 0 ||
proposed_theta[i + 1] >= 1) {
prob_move <- 0 # because prior is 0
} else {
current_prior <- prior(theta[i], ...)
current_likelihood <- dbinom(x, n, theta[i])
current_posterior <- current_prior * current_likelihood
proposed_prior <- prior(proposed_theta[i+1], ...)
proposed_likelihood <- dbinom(x, n, proposed_theta[i+1])
proposed_posterior <- proposed_prior * proposed_likelihood
prob_move <- proposed_posterior / current_posterior
}
# sometimes we "sail back"
if (runif(1) > prob_move) { # sail back
move[i + 1] <- FALSE
theta[i + 1] <- theta[i]
} else { # stay
move[i + 1] <- TRUE
theta[i + 1] <- proposed_theta[i + 1]
}
}
tibble(
step = 1:num_steps,
theta = theta,
proposed_theta = proposed_theta,
move = move, step_size = step_size
)
}
Question 3: What happens if the proposed value for \(\theta\) is not in the interval \([0,1]\)? Why?
Question 4: What do proposed_posterior
and current_posterior
correspond to in the story of King Markov?
Question 5: Notice that we are using the unnormalized posterior. Why don’t we need to normalize the posterior? Why is it important that we don’t have to normalize the posterior?
The purpose of all this was to get samples from the posterior distribution. We can use histograms or density plots to see what our MCMC algorithm shows us for the posterior distribution. When using MCMC algorithms, we won’t typically have ways of knowing the “right answer”. But in this case, we know the posterior is Beta(6, 11), so we can compare our posterior samples to that distribution to see how well things worked.
Let’s try a nice small step size like 0.2%.
set.seed(341)
Tour <- metro_bern(5, 15, step_size = 0.002)
gf_dhistogram(~ theta, data = Tour, bins = 100) %>%
gf_dens(~ theta, data = Tour) %>%
gf_dist("beta", shape1 = 6, shape2 = 11, color = "red")
Hmm. That’s not too good. Let’s see if we can figure out why.
A trace plot shows the “tour of King Markov’s ship” – the sequence of parameter values sampled (in order).
gf_line(theta ~ step, data = Tour) %>%
gf_hline(yintercept = ~ 5/15, color = "red")
Question 6: Why does the trace plot begin at a height of 0.5?
Question 7: What features of this trace plot indicate that our posterior sampling isn’t working well? Would making the step size larger or smaller be more likely to help?
Question 8: Since we know that the posterior distribution is Beta(6, 11), how could we use R to show us what an ideal trace plot would look like?
Let’s see how our choice of step
affects the sampling. Size 0 is sampling from a true Beta(6, 11) distribution.
set.seed(341)
Tours <-
bind_rows(
metro_bern(5, 15, step_size = 0.02),
metro_bern(5, 15, step_size = 0.2),
metro_bern(5, 15, step_size = 0.002),
tibble(theta = rbeta(1e4, 6, 11), step_size = 0, step = 1:1e4)
)
gf_dhistogram( ~ theta | step_size ~ ., data = Tours, bins = 100) %>%
gf_dist("beta", shape1 = 6, shape2 = 11, color = "red")
Sizes 0.2 and 0.02 look much better than 0.002.
Looking at the trace plots, we see that the samples using a step size of 0.02 are still highly auto-correlated (neighboring values are similar to each other). Thus the “effective sample size” is not nearly as big as the number of posterior samples we have generated.
With a step size of 0.2, the amount of auto-correlation is substantially reduced, but not eliminated.
gf_line(theta ~ step, data = Tours) %>%
gf_hline(yintercept = ~ 5/14, color = "red") %>%
gf_facet_grid(step_size ~ .) %>%
gf_lims(x = c(0,1000))
## Warning: Removed 9000 row(s) containing missing values (geom_path).
We are going to talk about effective sample size a lot as we start using MCMC algorithms to fit our models. So it is important to understand what this phrase means:
A sample has effective sample size \(s\) if it as useful for a particular purpose as an independent randome sample of \(s\) would be for that purpose.
For a particular purpose is an important phrase, since the same sample can have different effective sample sizes when used for different purposes.
Our purposes will typically involve estimating something related to a posterior distribution: the mean or the upper and lower limits of an HDI, for example. If we knew the posterior distribuiton, we could sample indpendently from the posterior and use our posterior samples to make the estimate. If we do this repeatedly, we won’t get the same answer every time – there will be sampling variability. The larger the sample size, the less the sampling variability.
We will say that a posterior sample of size \(n\) has effective sample size \(s\) for a particular estimate if it has the same amount of sampling variability as an independent sample of size \(s\).
We won’t be able to compute effective sample size exactly, but there are ways it can be estimated that gives us a good idea about the efficiency of our sampling. These methods are based on looking at auto-correlation in our samples. Auto-correlation is a way of answering this question: Are nearby rows of the posterior sample more similar to each other than purely random selections would be?
We can investigate auto-correlation using an auto-correlation (or ACF) plot.
# base graphics, rather space inefficient, and incompatible with our usual gg plots
acf(Tours %>% filter(step_size == 0.2) %>% pull(theta))
# gg version in CalvinBayes
acf_plot(Tours %>% filter(step_size == 0.2) %>% pull(theta)) %>%
gf_labs(title = "An Autocorrelation plot -- step size = 0.2")
By way of comparison, here is an autocorrelation plot for some independent random data.
acf_plot(rnorm(1E3))
So what are these plots? Let’s take a look at a small example.
set.seed(12345)
T <- tibble(
x = sample(1:10, 10),
x1 = lag(x, 1),
x2 = lag(x, 2),
x3 = lag(x, 3)
)
T %>% pander::pander()
x | x1 | x2 | x3 |
---|---|---|---|
3 | NA | NA | NA |
8 | 3 | NA | NA |
2 | 8 | 3 | NA |
5 | 2 | 8 | 3 |
10 | 5 | 2 | 8 |
9 | 10 | 5 | 2 |
7 | 9 | 10 | 5 |
6 | 7 | 9 | 10 |
4 | 6 | 7 | 9 |
1 | 4 | 6 | 7 |
cor(x ~ x, data = T)
## [1] 1
cor(x ~ x1, data = T, use = "complete.obs")
## [1] 0.1633743
cor(x ~ x2, data = T, use = "complete.obs")
## [1] -0.09626219
cor(x ~ x3, data = T, use = "complete.obs")
## [1] -0.2333364
acf_plot(T$x)
Each bar in the acf plot represents the correlation coefficient of the input vector with a lagged version of itself. That is, we shift all the values over and compute the correlation coefficient. The first bar (lag 0) always has value 1 because there is perfect correlation between a vector and itself. Lag 1 tells us about adjacent values. Lag 2 about values 2 apart, and so on. In a random sample, all but the first of these will be fairly close to zero – just a bit of random noise. The horizontal guide lines are a crude upper bound on how large most of these (95%) would be if we had truly independent sampling and the correlations were just due to randomness. So things are good when all or most of the bars are within or near the horizontal guidelines.
One way to reduce the auto-correlation in posterior samples is by thinning.
acf_plot(Tours %>% filter(step_size == 0.2) %>% pull(theta)) %>%
gf_labs(title = "An Autocorrelation plot -- step size = 0.2")
This auto-correlation plot suggests that keeping every 7th or 8th value when step_size
is 0.2 should give us a posterior sample that is nearly independent.
acf_plot(Tours %>% filter(step_size == 0.2, step %% 8 == 0) %>% pull(theta)) %>%
gf_labs(title = "An Autocorrelation plot -- step size = 0.2; 1/8 thinning")
gf_line(theta ~ step, data = Tours %>% filter(step %% 8 == 0)) %>%
gf_hline(yintercept = ~ 5/14, color = "red") %>%
gf_facet_grid(step_size ~ .) %>%
gf_lims(x = c(0, 3000)) %>%
gf_labs(
title = "Comparing different step sizes with 1/8 thinning"
)
## Warning: Removed 875 row(s) containing missing values (geom_path).
Now the idealized trace plot and the trace plot for step = 0.2
are quite similar looking. So the effective sample size for that step size is approximately 1/8 the number of posterior samples generated.
For step=0.02
we would need to thin even more since there is more auto-correlation here. This would require generating many more posterior samples to begin with:
Tours %>%
filter(step_size == 0.02, step > 1000) %>%
pull(theta) %>%
acf_plot(lag.max = 100)
Tours %>%
filter(step_size == 0.02, step > 1000) %>%
pull(theta) %>%
acf_plot(lag.max = 1000)
Important notes
We don’t need to actually thin the posterior samples to use them – we just need to understand that in effect our sample is not as large as it appears.
The effective sample size depends on what we are computing. It can be different for each posterior mean, and different again for each posterior HDI boundary. The effective sample size for HDI boundaries is usually lower than for means because the tails of a distribution are less stable than center of the distribution (because we sample fewer data points in the tails).
Software will compute an estimated effective sample size for us. We mainly want to check that this isn’t very small. A small effective sample size indicates inefficient sampling of the posterior and estimates that won’t be very precise.
We will generally like to have our effective sample size in the 100s if not 1000s.
Here is a crude estimate of effective sample size to give you some idea how it works. Let’s let
\[ ESS = \frac{\mbox{sample size}}{1 + \sum_k 2 r_k} \] where \(r_k\) is the \(k\)th lagged correlation.
Some good things about this measure:
The first auto-correlation is always 1. In a perfect world, rest would be 0, so we would get \(\frac{n}{1} = n\).
Generally, if there is more auto-correlation, the denominator will be larger and ESS will be smaller.
This measure is based on estimating the mean of a normal distribution from a correlated sample. It will work less well if the distribution isn’t normal or if you are interested in something other than the mean.
Soon we’ll see a couple fancier versions of this (computed in software for us, so we don’t need to know all the details).
Consider the following simple model:
In this case the posterior distribution for \(\mu\) can be worked out exactly and should be normal.
Let’s code up our Metropolis algorithm for this situation. New stuff:
rnorm()
metro_norm <- function(
y, # data vector
num_steps = 1e5,
step_size = 1, # sd's of jump distributions
start = list(mu = 0, log_sigma = 0)
) {
step_size <- rep(step_size, 2)[1:2] # make sure exactly two values
mu <- rep(NA, num_steps) # trick to pre-allocate memory
log_sigma <- rep(NA, num_steps) # trick to pre-allocate memory
move <- rep(NA, num_steps) # trick to pre-allocate memory
mu[1] <- start$mu
log_sigma[1] <- start$log_sigma
move[1] <- TRUE
for (i in 1:(num_steps - 1)) {
# head to new "island"
mu[i + 1] <- rnorm(1, mu[i], step_size[1])
log_sigma[i + 1] <- rnorm(1, log_sigma[i], step_size[2])
move[i + 1] <- TRUE
log_post_current <-
dnorm(mu[i], 0, 1, log = TRUE) +
dnorm(log_sigma[i], 0, 1, log = TRUE) +
sum(dnorm(y, mu[i], exp(log_sigma[i]), log = TRUE))
log_post_proposal <-
dnorm(mu[i + 1], 0, 1, log = TRUE) +
dnorm(log_sigma[i + 1], 0, 1, log = TRUE) +
sum(dnorm(y, mu[i + 1], exp(log_sigma[i+1]), log = TRUE))
prob_move <- exp(log_post_proposal - log_post_current)
# sometimes we "sail back"
if (runif(1) > prob_move) {
move[i + 1] <- FALSE
mu[i + 1] <- mu[i]
log_sigma[i + 1] <- log_sigma[i]
}
}
tibble(
step = 1:num_steps,
mu = mu,
log_sigma = log_sigma,
move = move,
step_size = paste(step_size, collapse = ", ")
)
}
Let’s use the algorithm with three different step size values and compare results.
set.seed(341)
y <- rnorm(25, 1, 2) # sample of 25 from Norm(1, 2)
Tour5 <- metro_norm(y = y, num_steps = 5000, step_size = 5)
Tour1 <- metro_norm(y = y, num_steps = 5000, step_size = 1)
Tour0.1 <- metro_norm(y = y, num_steps = 5000, step_size = 0.1)
Tour0.01 <- metro_norm(y = y, num_steps = 5000, step_size = 0.01)
Norm_Tours <- bind_rows(Tour5, Tour1, Tour0.1, Tour0.01)
df_stats(~ move | step_size, data = Norm_Tours, props)
## response step_size prop_FALSE prop_TRUE
## 1 move 0.01, 0.01 0.0352 0.9648
## 2 move 0.1, 0.1 0.2564 0.7436
## 3 move 1, 1 0.9064 0.0936
## 4 move 5, 5 0.9960 0.0040
gf_dhistogram( ~ mu | step_size ~ ., data = Norm_Tours, bins = 100)
gf_dhistogram( ~ exp(log_sigma) | step_size ~ ., data = Norm_Tours, bins = 100)
gf_line(mu ~ step | step_size ~ ., data = Norm_Tours)
gf_line(log_sigma ~ step | step_size ~ ., data = Norm_Tours)
Norm_Tours %>% filter(step_size == "5, 5") %>% pull(mu) %>%
acf_plot(lag.max = 100) /
Norm_Tours %>% filter(step_size == "1, 1") %>% pull(mu) %>%
acf_plot(lag.max = 100) /
Norm_Tours %>% filter(step_size == "0.1, 0.1") %>% pull(mu) %>%
acf_plot(lag.max = 100) /
Norm_Tours %>% filter(step_size == "0.01, 0.01") %>% pull(mu) %>%
acf_plot(lag.max = 100)
If we run multiple chains with different starting points and different random choices, we hope to see similar trace plots. After all, we don’t want our analysis to be an analysis of starting points or of random choices.
Tour1a <- metro_norm(y = y, num_steps = 5000, step_size = 1) %>% mutate(chain = "A")
Tour1b <- metro_norm(y = y, num_steps = 5000, step_size = 1) %>% mutate(chain = "B")
Tour1c <- metro_norm(y = y, num_steps = 5000, step_size = 1, start = list(mu = 10, log_sigma = 5)) %>% mutate(chain = "C")
Tour1d <- metro_norm(y = y, num_steps = 5000, step_size = 1, start = list(mu = 10, log_sigma = 5)) %>% mutate(chain = "D")
Tours1 <- bind_rows(Tour1a, Tour1b, Tour1c, Tour1d)
gf_line(mu ~ step, color = ~chain, alpha = 0.5, data = Tours1)
gf_line(mu ~ step, color = ~chain, alpha = 0.5, data = Tours1) %>%
gf_facet_grid( chain ~ .)
Not all posteriors are normal, but here’s what a chain would look like if the posterior is normal and there is no correlation between draws.
Ideal <- tibble(step = 1:5000, mu = rnorm(5000, 1, .3), step_size = "ideal")
gf_line(mu ~ step | step_size, data = Norm_Tours %>% bind_rows(Ideal))
If the draws are correlated, then we might get more ideal behavior if we selected only a subset – every 20th or every 30th value, for example. This is the idea behind “effective sample size”. The effective sample size of a correlated chain is the length of an ideal chain that contains as much independent information as the correlated chain.
acf_plot(Norm_Tours %>% filter(step_size == "1, 1") %>% pull(mu))
gf_line(mu ~ step | step_size ~ ., data = bind_rows(Norm_Tours, Ideal) %>% filter(step %% 20 == 0)) %>%
gf_labs(title = "Every 20th draw")
gf_line(mu ~ step | step_size ~ ., data = bind_rows(Norm_Tours, Ideal) %>% filter(step %% 30 == 0)) %>%
gf_labs(title = "Every 30th draw")
If we thin to every 20th value, our chains with step_size = 1
and step_size = 0.1
look quite similar to the ideal chain. The chain with step_size = 0.01
still moves too slowly through the parameter space. So tuning parameters will affect the effective sample size.
For some MCMC algorithms, the first portion of a chain may not work as well. For example, it may spend a long time in a region of low posterior density as it moves towards higher density regions and thus inflate the probabilities in this region. For algorithms with this issue, this portion is typically removed from the analysis since it is more an indication of the starting values used than the long-run sampling from the posterior. The basic Metropolis algorithm is susceptible to this issue. (The Hamiltonian Markov Chain method that Stan uses does not have this issue in the same way, but there will still be a pre-phase that is not part of the samples used to approximate the posterior.)
gf_line(log_sigma ~ step | step_size ~ ., data = Norm_Tours)
gf_path(log_sigma ~ mu | step_size ~ (step > 500), data = Norm_Tours, alpha = 0.5) %>%
gf_density2d(log_sigma ~ mu, data = Norm_Tours)
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
## Warning: stat_contour(): Zero contours were generated
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
gf_histogram( ~ mu | step_size ~ (step > 500) , data = Norm_Tours, binwidth = 0.1)
gf_histogram( ~ log_sigma | step_size ~ (step > 500), data = Norm_Tours, binwidth = 0.1)
Similar issues arise in many MCMC algorithms – fancier versions attempt to minimize the impact of these issues.
Gibbs sampling modifies the Metropolis algorithm by changing how the proposal is selected. It only works in situations where the marginal posterior distributions are known (as is the case when we use conjugate priors).
The idea:
At each step, pick one parameter to change.
Select a propasal for that parameter by randomly sampling from the marginal posterior for that parameter, assuming all the other parameters are as in the current parameter values.
What’s good about this:
What’s bad about this:
The idea:
The good:
Rarely reject a proposal, improving efficiency (see the bad for why “rarely” and not “never”).
Handles wider range of priors and likelihoods.
Does better (than Gibbs) with correlated parameters (but things still work better when there is less correlation).
Easier to diagnose common problems, so we (often) know when the algorithm is not working well an can try something else.
Coded in Stan. Wrappers exist for R, Python, etc.
The bad:
Each proposal take longer to compute (we have to simulate physics).
Lots of tuning parameters: how hard to flick, what direction, how long to let puck slide, how many little pieces to break the path into, etc., etc.
The physics simulation doesn’t always work out numerically stably, so we do actually reject sometimes. (Conservation of energy can be checked, and if it doesn’t hold well enough, we reject the proposal.)
This is the method we will be migrating to as our models become more complicated.
Suppose we have two coins and want to compare the proportion of heads each coin generates.
That is,
For the examples below we will use data showing that 6 of 8 tosses of the first coin were heads and only 2 of 7 tosses of the second coin. But the methods work equally well with other data sets. While comparing a small number of coin tosses like this is not so interesting, the method can be used for a wide range of practical things, liking testing whether two treatments for a condition are equally effective, etc.
From this we can work out the posterior as usual. (Pay attention to the important parts of the kernel, that is, the numerators.)
\[\begin{align*} p(\theta_1, \theta_2 \mid D) &= p(D \mid \theta_1, \theta_2) p(\theta_1, \theta_2) / p(D) \\[3mm] &= \theta_1^{z_1} (1 − \theta_1)^{N_1−z_1} \theta_1^{z_2} (1 − \theta_2)^{N_2−z_2} p(\theta_1, \theta_2) / p(D) \\[3mm] &= \frac{ \theta_1^{z_1} (1 − \theta_1)^{N_1 − z_1} \theta_1^{z_2} (1 − \theta_2)^{N_2 − z_2} \theta_1^{a_1−1}(1 − \theta_1)^{b_1 - 1} \theta_2^{a_2−1}(1 − \theta_2)^{b_2 - 1}} {p(D)B(a_1, b_1)B(a_2, b_2)} \\[3mm] &= \frac{ \theta_1^{z_1 + a_1 − 1}(1 − \theta_1)^{N_1 − z_1 + b_1 − 1} \theta_2^{z_2 + a_2 − 1}(1 − \theta_2)^{N_2 − z_2 + b_2 − 1} }{p(D)B(a_1, b_1)B(a_2, b_2)} \end{align*}\]
So the posterior distribution of \(\langle \theta_1, \theta_2 \rangle\) is two independent Beta distributions: \({\sf Beta}(z_1 + a, N_1 - z_1 + b_1)\) and \({\sf Beta}(z_2 + a, N_2 - z_2 + b_2)\).
Some nice images of these distributions appear on page 167.
metro_2coins <- function(
z1, n1, # z = successes, n = trials
z2, n2, # z = successes, n = trials
step_size = c(0.1, 0.1), # sds of jump distribution
start = c(0.5, 0.5), # value of thetas to start at
num_steps = 5e4, # number of steps to run the algorithm
prior1 = dbeta, # function describing prior
prior2 = dbeta, # function describing prior
args1 = list(), # additional args for prior1
args2 = list() # additional args for prior2
) {
theta1 <- rep(NA, num_steps) # trick to pre-allocate memory
theta2 <- rep(NA, num_steps) # trick to pre-allocate memory
proposed_theta1 <- rep(NA, num_steps) # trick to pre-allocate memory
proposed_theta2 <- rep(NA, num_steps) # trick to pre-allocate memory
move <- rep(NA, num_steps) # trick to pre-allocate memory
theta1[1] <- start[1]
theta2[1] <- start[2]
step_size1 <- step_size[1]
step_size2 <- step_size[2]
for (i in 1:(num_steps-1)) {
# head to new "island"
proposed_theta1[i + 1] <- rnorm(1, theta1[i], step_size1)
proposed_theta2[i + 1] <- rnorm(1, theta2[i], step_size2)
if (proposed_theta1[i + 1] <= 0 ||
proposed_theta1[i + 1] >= 1 ||
proposed_theta2[i + 1] <= 0 ||
proposed_theta2[i + 1] >= 1) {
proposed_posterior <- 0 # because prior is 0
} else {
current_prior <-
do.call(prior1, c(list(theta1[i]), args1)) *
do.call(prior2, c(list(theta2[i]), args2))
current_likelihood <-
dbinom(z1, n1, theta1[i]) *
dbinom(z2, n2, theta2[i])
current_posterior <- current_prior * current_likelihood
proposed_prior <-
do.call(prior1, c(list(proposed_theta1[i+1]), args1)) *
do.call(prior2, c(list(proposed_theta2[i+1]), args2))
proposed_likelihood <-
dbinom(z1, n1, proposed_theta1[i+1]) *
dbinom(z2, n2, proposed_theta2[i+1])
proposed_posterior <- proposed_prior * proposed_likelihood
}
prob_move <- proposed_posterior / current_posterior
# sometimes we "sail back"
if (runif(1) > prob_move) { # sail back
move[i + 1] <- FALSE
theta1[i + 1] <- theta1[i]
theta2[i + 1] <- theta2[i]
} else { # stay
move[i + 1] <- TRUE
theta1[i + 1] <- proposed_theta1[i + 1]
theta2[i + 1] <- proposed_theta2[i + 1]
}
}
tibble(
step = 1:num_steps,
theta1 = theta1,
theta2 = theta2,
proposed_theta1 = proposed_theta1,
proposed_theta2 = proposed_theta2,
move = move,
step_size1 = step_size1,
step_size2 = step_size2
)
}
Metro_2coinsA <-
metro_2coins(
z1 = 6, n1 = 8,
z2 = 2, n2 = 7,
step_size = c(0.02, 0.02),
args1 = list(shape1 = 2, shape2 = 2), args2 = list(shape1 = 2, shape2 = 2)
)
Metro_2coinsA %>%
gf_density2d(theta2 ~ theta1)
Metro_2coinsA %>%
gf_density(~ (theta2 - theta1))
# effective sample size is much smaller than apparent sample size due to auto-correlation
acf(Metro_2coinsA$theta2 - Metro_2coinsA$theta1)
Metro_2coinsA %>% filter(step < 500) %>%
gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5)
Metro_2coinsB <-
metro_2coins(
z1 = 6, n1 = 8,
z2 = 2, n2 = 7,
step_size = c(0.2, 0.2),
args1 = list(shape1 = 2, shape2 = 2), args2 = list(shape1 = 2, shape2 = 2)
)
Metro_2coinsB %>%
gf_density2d(theta2 ~ theta1)
Metro_2coinsB %>%
gf_density(~ (theta2 - theta1))
# effective sample size is better but still quite a bit
# smaller than apparent sample size due to auto-correlation
acf(Metro_2coinsB$theta2 - Metro_2coinsB$theta1)
Metro_2coinsB %>% filter(step < 500) %>%
gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5)
Gibbs sampling provides an attempt to improve on the efficiency of the standard Metropolis algorithm by using a different method to propose new parameter values. The idea is this:
Pick one of the parameter values: \(\theta_i\)
Determine the posterior distribution of \(\theta_i\) using current estimates of the other parameters \(\{\theta_j \mid j \neq i\}\)
Sample from the posterior distribution for \(\theta_i\) to get a new proposed value for \(\theta_i\).
Keep cycling through all the parameters, each time updating one using the current estimates for the others.
It takes some work to show that this also converges, and in practice it is often more efficient than the basic Metropolis algorithm. But it is limited to situations where the marginal posterior can be be sampled from.
In our example, we can work out the marginal posterior fairly easily:
\[\begin{align*} p(\theta_1|\theta_2, D) &= p(\theta_1, \theta_2|D)/p(\theta_2|D) \\ &= p(\theta_1, \theta_2|D) \int d\theta_1 p(\theta_1, \theta_2|D) \\ &= \frac{\mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 −z_1 + b_1) \cdot \mathrm{dbeta}(\theta_2, z_2 + a_2, N_2 −z_2 + b_2)} {\int d\theta_1 \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 −z_1 +b_1) \cdot \mathrm{dbeta}(\theta_2 \mid z_2 +a_2, N_2 −z_2 +b_2)} \\ &= \frac{\mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1) \cdot \mathrm{dbeta}(\theta_2, z_2 + a_2, N_2 − z_2 + b_2)} {\mathrm{dbeta}(\theta_2 \mid z_2 + a_2, N_2 −z_2 +b_2) \int d\theta_1 \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1)} \\ &= \frac{\mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1)} {\int d\theta_1 \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1)} \\ &= \mathrm{dbeta}(\theta_1, z_1 + a_1, N_1 − z_1 + b_1) \end{align*}\]
In other words, \(p(\theta_1 \mid \theta_2, D) = p(\theta_1, D)\), which isn’t surprising since we already know that the posterior distributions \(\theta_1\) and \(\theta_2\) are independent. (In more complicated models, the marginal posterior may depend on all of the parameters, so the calculation above might not be so simple.) Of course, the analogous statement holds for \(\theta_2\) in this model.
This examples shows off Gibbs sampling in a situation where it shines: when the posterior distribution is a collection of independent marginals.
gibbs_2coins <- function(
z1, n1, # z = successes, n = trials
z2, n2, # z = successes, n = trials
start = c(0.5, 0.5), # value of thetas to start at
num_steps = 1e4, # number of steps to run the algorithm
a1, b1, # params for prior for theta1
a2, b2 # params for prior for theta2
) {
theta1 <- rep(NA, num_steps) # trick to pre-allocate memory
theta2 <- rep(NA, num_steps) # trick to pre-allocate memory
theta1[1] <- start[1]
theta2[1] <- start[2]
for (i in 1:(num_steps-1)) {
if (i %% 2 == 1) { # update theta1
theta1[i+1] <- rbeta(1, z1 + a1, n1 - z1 + b1)
theta2[i+1] <- theta2[i]
} else { # update theta2
theta1[i+1] <- theta1[i]
theta2[i+1] <- rbeta(1, z2 + a2, n2 - z2 + b2)
}
}
tibble(
step = 1:num_steps,
theta1 = theta1,
theta2 = theta2,
)
}
Gibbs <-
gibbs_2coins(z1 = 6, n1 = 8,
z2 = 2, n2 = 7,
a1 = 2, b1 = 2, a2 = 2, b2 = 2)
Gibbs %>% gf_density2d(theta2 ~ theta1)
Gibbs %>% gf_dens( ~ theta1)
acf(Gibbs$theta1)
Gibbs %>% filter(step < 500) %>%
gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5)
Note: Software like JAGS only records results each time it makes a complete cycle through the parameters. We could do that by keeping every other row:
Gibbs %>% filter(step %% 2 == 0) %>% gf_density2d(theta2 ~ theta1)
Gibbs %>% filter(step %% 2 == 0) %>% gf_density( ~ (theta2 - theta1))
Gibbs %>% filter(step %% 2 == 0) %>%
mutate(difference = theta2 - theta1) %>%
pull(difference) %>%
acf()
Gibbs %>% filter(step < 500, step %% 2 == 0) %>%
gf_path(theta2 ~ theta1, color = ~ step, alpha = 0.5) %>%
gf_point(theta2 ~ theta1, color = ~ step, alpha = 0.5)
The good news is that other people have done the hard work of coding up Gibbs sampling for us, we just need to learn how to describe the model in a way that works with their code. We will be using JAGS (just another Gibbs Sampler) and later Stan (which generalizes the metropolis algorithm in a different way) to fit more interesting models that would be too time consuming to custom code on our own. The examples in the chapter are to help us understand better how these algorithms work so we can interpret the results we obtain when using them.
We have seen that we can fit this model a number of different ways now, but we haven’t actually looked at the results. Are the coins different? How much different?
Since the Gibbs sampler is more efficient than the Metropolis algorithm for this situation, we’ll use the posterior samples from the Gibbs sampler to answer. We are interested in \(\theta_2 - \theta_1\), the difference in the biases of the two coins. To learn about that, we simply investigate the distribution of that quantity in our posterior samples. (Note: You cannot learn about this difference by only considering \(\theta_1\) and \(\theta_2\) separately.)
gf_dhistogram(~(theta2 - theta1), data = Gibbs) %>%
gf_dens()
hdi(Gibbs %>% mutate(difference = theta2 - theta1), pars = "difference")
## par lo hi mode prob
## 1 difference -0.6627973 0.07687787 -0.3758359 0.95
mosaic::prop(~(theta2 - theta1 > 0), data = Gibbs)
## prop_TRUE
## 0.0643
We don’t have much data, so the posterior distribution of the difference in biases is pretty wide. 0 is within the 95% HDI for \(\theta_2 - \theta_1\), so we don’t have compelling evidence that the two biases are different based on this small data set.
We will refer to one random walk starting from a particular starting value as a chain.
Our goal, whether we use Metropolis, Gibbs sampling, Stan, or some other MCMC posterior sampling method, is to generate random values from the posterior distribution. Ideally these samples should be
representative of the posterior and not of other artifacts like the starting location of the chain, or tuning parameters used to generate the walk.
accurate to the posterior. If we run the algorithm multiple times, we would like the results to be similar from run to run. (They won’t match exactly, but if they are all giving good approximations to the same thing, then they should all close to each other.)
efficient. The theory says that all of these methods converge to the correct posterior distribution, but in practice, we can only do a finite run. If the sampling is not efficient enough, our finite run might give us a distorted picture (that would eventually have been corrected had we run things long enough).
If we are convinced that the posterior sampling is of high enough quality, we can use the posterior samples to answer all sorts of questions relatively easily.
Learn to design more interesting models to answer more interesting questions.
Learn to describe these models and hand them to Stan (or some other MCMC algorithm) for posterior sampling.
Learn to diagnose posterior samples to detect potential problems with the posterior sampling.