King Markov is king of a chain of 5 islands. Rather than live in a palace, he lives in a royal boat. Each night the royal boat anchors in the harbor of one of the islands. The law declares that the king must harbor at each island in proportion to the population of the island.
Example: if the populations of the islands are 100, 200, 300, 400, and 500 people, how often must King Markove harbor at each island?
King Markov has some personality quirks:
He can’t stand record keeping. So he doesn’t know the poplations on his islands and doesn’t keep track of which islands he has visited when.
He can’t stand routine (variety is the spice of his life), so he doesn’t want to know each night where he will be the next night.
He asks Advisor Metropolis to devise a way for him to obey the law but that
He can ask the clerk on any island what their population is whenever he needs to know. But it takes half a day to sail from one island to another, so he is limited in how much information he can obtain this way each day.
Metropolis devises the following scheme:
Each morning, randomly pick one of the 4 other islands (a proposal island) and travel there in the morning, inquiring about the population over lunch.
Let \(J(b \mid a)\) be the conditional probability of selecting island \(b\) as the candidate if \(a\) is the current island.
\(J\) does not depend on the populations of the islands (since the King can’t remember them).
Before leaving, enquire about the population of the current island.
When arriving at the proposal island, equire about its population.
If the proposal island has more people, stay at proposal island for the night (since the king should prefer more populated islands).
If the proposal island has fewer people, stay at the proposal island with probabilty \(R\), else return to the “current” island (ie, last night’s island).
Metropolis is convinced that for the right choices of \(J\) and \(R\), this will satify the law.
He quickly determines that \(R\) cannot be 0 and cannot be 1:
What happens if \(R = 1\)?
What happens if \(R = 0\)?
Somehow \(R\) must depend on the populations of the current and proposal islands. But how? If \(R\) is too large, the king will visit small islands too often. If \(R\) is too small, he will visit large islands too often.
Fortunately, Metropolis knows about Markov Chains. Unfortunately, some of you may not. So let’s learn a little bit about Markov Chains and then figure out how Metropolis should choose \(J\) and \(R\).
This is going to be very quick.
You can learn more, if you are interested, by going to
Consider a random process that proceeds in discrete steps (often referred to as time). Let \(X_t\) represent the “state” of the process at time \(t\). Since this is a random process, \(X_t\) is random, and we can ask probability questions like ``What is the probability of being in state ____ at time ____?“, ie, What is \(P(X_t = x)\)?
If \[ P(X_{t+1} = x \mid X_t = x_t, X_{t-1} = x_{t-1}, ... , X_0 = x_0) = P(X_{t+1} \mid X_{t} = x_t) \] then we say that the process is a Markov Chain. The intuition is that (the probabilities of) what happens next depends only on the current state and not on previous history.
The simplest version of a Markov Chain is one that is time-homogeneous:
\[ P(X_{t+1} = x_j \mid X_t = x_i) = p_{ij} \] That is, the probability of moving from state \(i\) to state \(j\) is one step is the same at every step.
A time-homogeneous Markov Chain can be represented my a square matrix \(M\) with
\[
M_{ij} = p_{ij} = \mbox{probability of transition from state $i$ to state $j$ in one step}
\] (This will be an infinite matrix if the state space in infinite, but we’ll start with simple examples with small, finite state spaces.) \(M_{ij}\) is the probability of moving in one step from state \(i\) to state \(j\).
More generally, we will write \(M^{(k)}_{ij}\) for the probability of moving from state \(i\) to state \(j\) in \(k\) steps.
Small Example:
M <- rbind( c(0, 0.5, 0.5), c(0.25, 0.25, 0.5), c(0.5, 0.3, 0.2))
M
## [,1] [,2] [,3]
## [1,] 0.00 0.50 0.5
## [2,] 0.25 0.25 0.5
## [3,] 0.50 0.30 0.2
The Metropolis Algorithm:
A time-homegeneous Markov Chain, is called regular if there is a number \(k\) such that
Small Example:
Metropolis Algorithm:
Regular Markov Chains have a very nice property:
\[ \lim_{k \to \infty} M^{(k)} = W \] where every row of \(W\) is the same. This says that, no matter where you start the process, the long-run probability of being in each state will be the same.
In our example above, convergence is quite rapid:
M %^% 20
## [,1] [,2] [,3]
## [1,] 0.2769231 0.3384615 0.3846154
## [2,] 0.2769231 0.3384615 0.3846154
## [3,] 0.2769231 0.3384615 0.3846154
M %^% 21
## [,1] [,2] [,3]
## [1,] 0.2769231 0.3384615 0.3846154
## [2,] 0.2769231 0.3384615 0.3846154
## [3,] 0.2769231 0.3384615 0.3846154
Note: If we apply the matrix \(M\) to the limiting probability (\(w\), one row of \(W\)), we just get \(w\) back again:
\[w M = w\]
W <- M %^% 30
W[1,]
## [1] 0.2769231 0.3384615 0.3846154
W[1,] %*% M
## [,1] [,2] [,3]
## [1,] 0.2769231 0.3384615 0.3846154
In fact, this is a necessary and sufficient condition for the limiting probability.
So, here’s what Metropolis needs to do: Choose \(J\) and \(R\) so that
If \(R\) is betwen 0 and 1, and the jumping rule allows us to get to all the islands (eventaully), then the Markov Chain will be regular, so there will be a limiting distribution. But the limiting distribution must be the one the law requires. It suffices to show that if the law is satisfied at time \(t\) it is satified at time \(t+1\) (\(wM = w\)):
\[ P(X_t = a) = f(a) \mbox{ for all $a$ } \Rightarrow P(X_{t+1} = a) = f(a) \mbox{ for all $a$} \]
Here’s the trick: We will choose \(J\) and \(R\) so that the following two unconditional probabilities are equal.
\[ P(a \to_t b) = P(b \to_t a) \]
Why does this work?
Suppose \(P(X_t = a) = f(a)\) as the law prescribes.
\(P(a \to_t b) = P(b \to_t a)\) makes the joint distribution symmetric: For any \(a\) and any \(b\).
\[P(X_{t} = a, X_{t+1} = b) = P(X_{t} = b, X_{t+1} = a)\]
\[P(X_t = a) = P(X_{t+1} = a)\]
Time for some algebra (and probability)! How do we choose \(J\) and \(R\)? Recall the ingredients:
Consider two islands – \(a\) and \(b\) – with \(F(b) > F(a)\). Let’s calculate the probabilities of moving between the islands.
Probability of moving from \(a\) to \(b\) = \(f(a) J(b \mid a) \cdot 1\)
Probability of moving from \(b\) to \(a\) = \(f(b) J(a \mid b) \cdot R\).
How do we choose J and R to make these probabilities equal?
Answer: We need to choose a jump rule (\(J()\)) so that the Markov chain is regular (ie, so that is it is possible to get from any island to any island in some fixed number of steps). Given such a \(J()\), we can let \(R\) be defined by
\[ R = \frac{f(a) J(b \mid a)}{f(b) J(a \mid b)} \] And espcially easy case is when \(J(b \mid a) = J(a \mid b)\). In that case \[ R = \frac{f(a)}{f(b)} \]
The original Metropolis algorithm used symmetric jump rules. The later generalization (Metropolis-Hastings) employed non-symmetric jump rules to get better performance of the Markov Chain.
KingMarkov <- function(
num_steps = 1e5,
population = 1:10,
start = 1,
J = function(a,b) 1/(length(population) - 1)
) {
num_islands <- length(population)
island <- rep(NA, num_steps) # trick to pre-alocate memory
current <- start
for (i in 1:num_steps) {
# record current island
island[i] <- current
# propose any one of the other islands
other_islands <- setdiff(1:num_islands, current)
proposal <-
sample(other_islands, 1,
prob = purrr::map(other_islands, ~ J(current, .x)))
# move?
prob_move <- population[proposal] / population[current]
current <- ifelse(runif(1) < prob_move, proposal, current)
}
data_frame(
step = 1:num_steps,
island = island
)
}
Tour <- KingMarkov(5000)
Target <- data_frame(island = 1:10, prop = (1:10)/ sum(1:10))
gf_line(island ~ step, data = Tour %>% filter(step < 200))
gf_histogram2( ..density.. ~ island, data = Tour, binwidth = 1) %>%
gf_point(prop ~ island, data = Target, color = "red")
neighbor <- function(a, b) as.numeric(abs(a-b) %in% c(1,9))
Tour <- KingMarkov(5000, J = neighbor)
Target <- data_frame(island = 1:10, prop = (1:10)/ sum(1:10))
gf_line(island ~ step, data = Tour %>% filter(step < 200))
gf_histogram2( ..density.. ~ island, data = Tour, binwidth = 1) %>%
gf_point(prop ~ island, data = Target, color = "red")
That was a nice story, and some nice probability theory. But what does it have to do with Bayesian computation? Regular Markov Chains (and some generalizations of them) can be used to sample from a posterior distribution:
\[R = \frac{\mbox{proposal "posterior"}}{\mbox{current "posterior"}}\]
Metropolis-Hastings variation:
Other variations:
Consider the following simple model:
In this case the posterior distribution for \(\mu\) can be worked out exactly and should be normal.
Let’s code up our Metropolis algorithm for this situation. New stuff:
rnorm()
MetropolisNorm <- function(
num_steps = 1e5,
y,
size = 1, # sd of jump distribution
start = list(mu = 0, log_sigma = 0)
) {
mu <- rep(NA, num_steps) # trick to pre-alocate memory
log_sigma <- rep(NA, num_steps) # trick to pre-alocate memory
move <- rep(NA, num_steps) # trick to pre-alocate memory
mu[1] <- start$mu
log_sigma[1] <- start$log_sigma
move[1] <- TRUE
for (i in 1:(num_steps-1)) {
# head to new "island"
mu[i + 1] <- rnorm(1, mu[i], size)
log_sigma[i + 1] <- rnorm(1, log_sigma[i], size)
move[i + 1] <- TRUE
log_post_current <-
dnorm(mu[i], 0, 1, log = TRUE) +
dnorm(log_sigma[i], 0, 1, log = TRUE) +
sum(dnorm(y, mu[i], exp(log_sigma[i]), log = TRUE))
log_post_proposal <-
dnorm(mu[i + 1], 0, 1, log = TRUE) +
dnorm(log_sigma[i + 1], 0, 1, log = TRUE) +
sum(dnorm(y, mu[i + 1], exp(log_sigma[i+1]), log = TRUE))
prob_move <- exp(log_post_proposal - log_post_current)
# sometimes we "sail back"
if (runif(1) > prob_move) {
move[i + 1] <- FALSE
mu[i + 1] <- mu[i]
log_sigma[i + 1] <- log_sigma[i]
}
}
data_frame(
step = 1:num_steps,
mu = mu,
log_sigma = log_sigma,
move = move,
size = size
)
}
Let’s use the algorithm with three different size values and compare results.
y <- rnorm(25, 10, 3)
Tour1 <- MetropolisNorm(y = y, num_steps = 5000, size = 1)
Tour0.2 <- MetropolisNorm(y = y, num_steps = 5000, size = 0.2)
Tour5 <- MetropolisNorm(y = y, num_steps = 5000, size = 5)
Tours <- bind_rows(Tour1, Tour0.2, Tour5)
prop(~ move | size, data = Tours)
## TRUE.0.2 TRUE.1 TRUE.5
## 0.5880 0.1452 0.0130
gf_dens( ~ mu | size ~ ., data = Tours)
gf_dens( ~ log_sigma | size ~ ., data = Tours)
gf_line(mu ~ step | size ~ ., data = Tours)
gf_line(log_sigma ~ step | size ~ ., data = Tours)
If we run multiple chains with different starting points and different random choices, we hope to see similar trace plots. After all, we don’t want our analysis to be an analysis of starting points or of random choices.
Tour1a <- MetropolisNorm(y = y, num_steps = 5000, size = 1) %>% mutate(chain = "A")
Tour1b <- MetropolisNorm(y = y, num_steps = 5000, size = 1) %>% mutate(chain = "B")
Tour1c <- MetropolisNorm(y = y, num_steps = 5000, size = 1, start = list(mu = 10, log_sigma = 5)) %>% mutate(chain = "C")
Tour1d <- MetropolisNorm(y = y, num_steps = 5000, size = 1, start = list(mu = 10, log_sigma = 5)) %>% mutate(chain = "D")
Tours1 <- bind_rows(Tour1a, Tour1b, Tour1c, Tour1d)
gf_line(mu ~ step + color::chain, alpha = 0.5, data = Tours1)
gf_line(mu ~ step + color::chain, alpha = 0.5, data = Tours1) %>%
gf_facet_grid( chain ~ .)
Not all posteriors are normal, but here’s what a chain would look like if the posterior is normal and there is no correlation between draws.
Ideal <- data_frame(step = 1:5000, mu = rnorm(5000, 2, 1), size = 0)
gf_line(mu ~ step | size, data = Tours %>% bind_rows(Ideal))
If the draws are correlated, then we might get more ideal behavior if we selected only a subset – every 10th or every 20th value, for example. This is the idea behind “effective sample size”. The effective sample size of a correlated chain is the length of an ideal chain that contains as much independent information as the correlated chain.
gf_line(mu ~ step | size ~ ., data = bind_rows(Tours, Ideal) %>% filter(step %% 10 == 0)) %>%
gf_labs(title = "Every 10th draw")
gf_line(mu ~ step | size ~ ., data = bind_rows(Tours, Ideal) %>% filter(step %% 20 == 0)) %>%
gf_labs(title = "Every 20th draw")
If we thin to every 20th value, our chain with size = 1
is looking quite similar to the ideal chain. The chain with size = 0.2
still moves to slowly from place to place, and the chain with size = 5
is still getting stuck in one place too long. So tuning paramters will affect the effective sample size.
The first portion of a chain may not work as well. This portion is typically removed from the analysis since it is more an indication of the starting values used than the long-run sampling from the posterior.
gf_line(log_sigma ~ step | size ~ ., data = Tours)
gf_path(log_sigma ~ mu | size ~ (step > 500), data = Tours, alpha = 0.5) %>%
gf_density2d(log_sigma ~ mu, data = Tours)
gf_histogram( ~ mu | size ~ (step > 500) , data = Tours, binwidth = 0.1)
gf_histogram( ~ log_sigma | size ~ (step > 500), data = Tours, binwidth = 0.1)
These are really issues with all MCMC algorithms, not just the Metropolis version:
As it turns out, we won’t use the Metropolis or Metropolis-Hastings algorithms in practice. Insead we will use Stan, which shares many features in common with the Metropolis algorithm but has demonsrated that it works well in a wide variety (but not all) models with better performance. Performance improvements are especially notable when the number of parameters in the model is large.