Markov Chain Monte Carlo (MCMC)

King Markov and Adviser Metropolis

The Law of the Land

King Markov is king of a chain of 5 islands. Rather than live in a palace, he lives in a royal boat. Each night the royal boat anchors in the harbor of one of the islands. The law declares that the king must harbor at each island in proportion to the population of the island.

Question 1: If the populations of the islands are 100, 200, 300, 400, and 500 people, how often must King Markov harbor at each island?

island 1 2 3 4 5
proportion 1/15 2/15 3/15 4/15 5/15

A Quirky King

King Markov has some personality quirks:

  • He can’t stand record keeping. So he doesn’t know the populations on his islands and doesn’t keep track of which islands he has visited when.

  • He can’t stand routine (variety is the spice of his life), so he doesn’t want to know each night where he will be the next night.

He asks Adviser Metropolis to devise a way for him to obey the law but that

  • randomly picks which island to stay at each night,
  • doesn’t require him to remember where he has been in the past, and
  • doesn’t require him to remember the populations of all the islands.

He can ask the clerk on any island what the island’s population is whenever he needs to know. But it takes half a day to sail from one island to another, so he is limited in how much information he can obtain this way each day.

The Metropolis Algorithm

Metropolis devises the following scheme:

  • Each morning, have breakfast with the island clerk and inquire about the population of the current island.

  • Then randomly pick one of the 4 other islands (a proposal island) and travel there in the morning

    • Let \(J(b \mid a)\) be the conditional probability of selecting island \(b\) as the proposal island if \(a\) is the current island.

    • \(J(b \mid a)\) may not depend on the populations of the islands (since the king can’t remember them).

  • Over lunch at the proposal island, inquire about its population.

    • If the proposal island has more people, stay at the proposal island for the night (since the king should prefer more populated islands).

    • If the proposal island has fewer people, stay at the proposal island with probability \(A\), else return to the “current” island (ie, the previous night’s island).

Metropolis is convinced that for the right choices of \(J\) and \(A\), this will satisfy the law.

He quickly determines that \(A\) cannot be 0 and cannot be 1:

Question 2a. What happens if \(A = 0\)? What happens if \(A = 1\)?

Question 2b. Why must \(A\) depend on the populations of islands \(a\) and \(b\)? (Show that if \(A\) doesn’t depend on the populations, then Metropolis’ algorithm will not satisfy the law.)

If \(A = 0\), then the king will always stay at the larger of the two islands, so once he gets to the largest island, he will stay there for ever.

If \(A = 1\), it is a little subtler. In this case if the proposal island is larger, the king stays there (as he always does when the candidate island has larger population). If the proposal island is smaller, he also stays there, because \(A = 1\). So now the king’s tour is completely determined by which islands are proposed, which doesn’t depend on the population, so can’t possibly get things right.

In fact, that same argument holds for any \(A\) that doesn’t depend on the populations. Since the tour is independent of the populations, at best it could be correct for one set of population ratios – it can’t work in general.

Let’s denote \(A\) as \(A(b \mid a)\) to make this clear and let \(P(a)\) be the population of island \(a\).

Quick notation review:

  • \(J(b \mid a) =\) probability of choosing to journey to island \(b\) in the morning if King Markov was on island \(a\) the previous night. [J for journey.]

  • \(A(b \mid a) =\) the probability of staying on island \(b\) if King Markov has just journeyed there from island \(a\). [A for accept.]

    • \(1-A(b \mid a)\) is the probability of sailing back to island \(a\).
    • Note that \(A(b \mid a) = 1\) if \(P(b) > P(a)\).
  • \(P(a) =\) the population of island \(a\). [P for population.]

  • \(p(a) = \frac{p(a)}{\sum_x p(x)}=\) the proportion of the entire population that lives on island \(a\). [p for proportion.]

    • Our goal is that \(p(x)\) is the proportion of time spent on each Island, but
    • the king and Metropolis won’t ever know \(p(x)\) as they go along, only \(P(a)\) and \(P(b)\).

But how should \(A\) be chosen? If \(A\) is too large, the king will visit small islands too often. If \(A\) is too small, he will visit large islands too often.

Fortunately, Metropolis knows about Markov Chains. Unfortunately, some of you may not. So let’s learn a little bit about Markov Chains and then figure out how Metropolis should choose \(J\) and \(A\).

Quick Intro to Markov Chains

More info, please

This is going to be very quick. You can learn more, if you are interested, by going to

Definition

Consider a random process that proceeds in discrete steps (often referred to as time). Let \(X_t\) represent the “state” of the process at time \(t\). Since this is a random process, \(X_t\) is random, and we can ask probability questions like ``What is the probability of being in state ____ at time ____?", ie,

\[ \mbox{What is $\mathrm{Pr}(X_t = x)$?} \]

If \[ \mathrm{Pr}(X_{t+1} = x \mid X_t = x_t, X_{t-1} = x_{t-1}, \dots , X_0 = x_0) = \mathrm{Pr}(X_{t+1} \mid X_{t} = x_t) \] then we say that the process is a Markov Chain. The intuition is that (the probabilities of) what happens next depends only on the current state and not on previous history.

Time-Homogeneous Markov Chains

The simplest version of a Markov Chain is one that is time-homogeneous:

\[ \mathrm{Pr}(X_{t+1} = b \mid X_t = a) = M_{ab} \] That is, the (conditional) probability of moving from state \(a\) to state \(b\) in one step is the same at every time. (\(M\) for Metropolis, and because we have already used \(P\) and \(p\), and because of the next section.)

Matrix representation

A time-homogeneous Markov Chain can be represented by a square matrix \(M\) with

\[ M_{ij} = \mbox{probability of transition from state $i$ to state $j$ in one step} \] (This will be an infinite matrix if the state space in infinite, but we’ll start with simple examples with small, finite state spaces.) More generally, we will write \[ M^{(k)}_{ij} = \mbox{probability of transition from state $i$ to state $j$ in $k$ steps} \]

Small Example:

M <- rbind( c(0, 0.5, 0.5), c(0.25, 0.25, 0.5), c(0.5, 0.3, 0.2))
M %>% pander::pander()
0 0.5 0.5
0.25 0.25 0.5
0.5 0.3 0.2

Question 3:

  • How many states does this process have?

  • What is the probability of moving from state 1 to state 3 in 1 step?

  • What is the probability of moving from state 1 to state 3 in 2 steps? (Hint: what are the possible stopping points along the way?)

  • How do we obtain \(M^{(2)}\) from \(M\)?

  • How do we obtain \(M^{(k)}\) from \(M\)?

You should see that the probability can be written as a sum of products. That should remind of a dot product, which should remind you of matrix multiplication. What do you get when you mutliply \(M \codt M\)?

Question 4: The Metropolis Algorithm is a Markov process. Express your answers in terms of \(J\), \(A\), and \(P\) where necessary.

  • What are the states of the Metropolis algorithm?

  • If King Markov stays on Island 2 one night, what is the probability that he will stay on Island 3 the next night?

  • If King Markov stays on Island 3 one night, what is the probability that he will stay on Island 2 the next night?

  • What is the general formula for the probability of moving from island \(a\) to island \(b\) (in one step)?

\[\mathrm{Pr}(X_{t+1}=b \mid X_t = a) =\]

One state per island.

The interesting case here is when \(a \neq b\), so you can limit yourself to that case. (The case where \(a = b\) is a bit more complicated, but won’t be involved in our main line of argument. Sometimes you just get lucky that way.)

To work out the probabilities, remember that two things must happen to move from \(a\) to \(b\): The king must journey there in the morning and the proposal island must be accepted as the next evening location. Keep in mind that if \(P(b) > P(a)\), then \(A(b \mid a) = 1\). You may want to consider two cases: (a) \(P(b) > P(a)\) and (b) \(P(a) > P(b)\). (We’ll think about \(P(a) = P(b)\) a little bit later, but it won’t be a problem.)

Regular Markov Chains

A time-homogeneous Markov Chain, is called regular if there is a number \(k\) such that

  • every state is reachable from every other state with non-zero probability in \(k\) steps

Question 5a

  • Is our small example regular? If so, how many steps are required?

Question 5b

  • What conditions can you impose on the Metropolis algorithm to ensure that it is a regular Markov Chain?

Regular Markov Chains have a very nice property:

\[ \lim_{k \to \infty} M^{(k)} = W \ \ \mbox{and every row of $W$ is the same} \]

This says that, no matter where you start the process, the long-run probability of being in each state will be the same.

In our small example above, convergence is quite rapid:

M %^% 20
##           [,1]      [,2]      [,3]
## [1,] 0.2769231 0.3384615 0.3846154
## [2,] 0.2769231 0.3384615 0.3846154
## [3,] 0.2769231 0.3384615 0.3846154
M %^% 21
##           [,1]      [,2]      [,3]
## [1,] 0.2769231 0.3384615 0.3846154
## [2,] 0.2769231 0.3384615 0.3846154
## [3,] 0.2769231 0.3384615 0.3846154

Note: If we apply the matrix \(M\) to the limiting probability (\(w\), one row of \(W\)), we just get \(w\) back again:

\[w M = w\]

W <- M %^% 30
W[1,]
## [1] 0.2769231 0.3384615 0.3846154
W[1,] %*% M
##           [,1]      [,2]      [,3]
## [1,] 0.2769231 0.3384615 0.3846154

In fact, this is a necessary and sufficient condition for the limiting probability.

So, here’s what Metropolis needs to do: Choose \(J\) and \(A\) so that

  • His algorithm is a regular Markov Chain with matrix \(M\).

  • If \(w = \langle p(1), p(2), p(3), p(4), p(5) \rangle\) is the law-prescribed probabilities for island harboring, then \(w M = w\).

Back to King Markov

If \(A (b \mid a) > 0\), and the jumping rule allows us to get to all the islands (eventually), then the Markov Chain will be regular, so there will be a limiting distribution. But the limiting distribution must be the one the law requires.

Let \(w = \langle p(1), p(2), ..., p(5)\rangle\) be the legally prescribed probability vector. It suffices to show that

\[wM = w\] (This requires some justification, but intuitively it should seem plausible that if there is a limiting matrix with each row equal to \(w\), then as we get close to the limit, things shouldn’t change much, and perhaps at the limit, things don’t change at all. For now, you just have to trust the folks who know Markov chains…)

So what does \(wM = w\) say? It says that if the probabilities are correct at step \(t\), then then will be correct again at step \(t + 1\), since multiplying by \(M\) does one step of the Markov chain. So we want to show that

\[ \mathrm{Pr}(X_t = a) = p(a) \mbox{ for all $a$ } \Rightarrow \mathrm{Pr}(X_{t+1} = a) = p(a) \mbox{ for all $a$} \]

Here’s the trick: We will choose \(J\) and \(A\) so that the following two unconditional probabilities are equal.

\[ \mathrm{Pr}(a \to_t b) = \mathrm{Pr}(b \to_t a) \] where \(\mathrm{Pr}(a \to_t b) = \mathrm{Pr}(X_t = a \ \& \ X_{t+1} = b)\).

Why does this work?

  • Suppose \(\mathrm{Pr}(X_t = a) = p(a)\) as the law prescribes.

  • \(\mathrm{Pr}(a \to_t b) = \mathrm{Pr}(b \to_t a)\) makes the joint distribution symmetric: For any \(a\) and any \(b\).

\[\mathrm{Pr}(X_{t} = a \ \& \ X_{t+1} = b) = \mathrm{Pr}(X_{t} = b \ \& \ X_{t+1} = a)\]

  • This means that both marginals are the same, so for any \(a\):

\[\mathrm{Pr}(X_t = a) = \mathrm{Pr}(X_{t+1} = a)\]

  • In other words, the probability that any island is the current current island will be the same as the probability that it is the next island: \(w M = w\).

Now we just need to find \(J\) and \(A\) to make this happen.

How do we choose \(J\) and \(A\)? Time for some algebra (and probability)!

Recall the ingredients:

  • \(P(a)\) be the population of island \(a\)
  • \(p(a)\) be the proportion of the total population living on island \(a\): \(p(a) = \frac{P(a)}{\sum_x P(x)}\)
  • \(J(b \mid a)\) is the conditional probability of selecting island \(b\) as the candidate when \(a\) is the current island. (J for Jump probability)
  • \(A(b \mid a)\) is the probability of accepting proposal island \(b\) if it is proposed from island \(a\).

Question 6: Consider two islands – \(a\) and \(b\) – with \(P(b) > P(a)\). Assume that probability of being on island \(x\) is \(p(x)\). Calculate the following probabilities (in terms of things like \(p\), \(J\), and \(A\)).

  • (Unconditional) probability of moving from \(a\) to \(b = \mathrm{Pr}(a \to_t b) =\)
  • (Unconditional) probability of moving from \(b\) to \(a = \mathrm{Pr}(b \to_t a) =\)

Question 7: How do we choose \(A\) to make these probabilities equal? Will this work for any jumping rule \(J\)?

\[ A(a \mid b) = \phantom{\frac{p(a) J(b \mid a)}{p(b) J(a \mid b)}} \]

Question 8: Symmetric jump rules.

  • What happens to the formula for \(A(a \mid b)\) if the jumping rule is symmetric: \(J(b \mid a) = J(a \mid b)\)?

  • Remember, the king doesn’t like to remember stuff, and this means half as much stuff to remember about the jump rules.

  • Will the king like the new acceptance rule? (The king won’t be so happy if the simpler jump rule makes the acceptance rule a lot more complicated.)

Question 9: Constant jump rules.

  • Is it possible to do this if we require that \(J(y \mid x) = J(y' \mid x')\) for all \(y \neq x\) and \(y' \neq x'\)? (This would make life even easier for the king.)

  • For King Markov, what would \(J\) be if we did it this way?

The original Metropolis algorithm used symmetric jump rules. The later generalization (Metropolis-Hastings) employed non-symmetric jump rules to get better performance of the Markov Chain.

Question 10: What does this have to do with Bayesian inference?

  • What does an island represent?
  • What does \(p(a)\) represent? [Hint: What do we always want to know when doing Bayesian inference?]
  • What does the population \(P(a)\) represent?
  • Why is \(P(a)\) easier to calculate than \(p(a)\)? [This explains why in our story, the King doesn’t know any of the other populations except for the two islands visited on a given day.]
  • Why is it good enough to visit state \(a\) with probability \(p(a)\) without actually knowing what \(p(a)\) is?