King Markov is king of a chain of 5 islands. Rather than live in a palace, he lives in a royal boat. Each night the royal boat anchors in the harbor of one of the islands. The law declares that the king must harbor at each island in proportion to the population of the island.
Question 1: If the populations of the islands are 100, 200, 300, 400, and 500 people, how often must King Markov harbor at each island?
island | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
proportion | 1/15 | 2/15 | 3/15 | 4/15 | 5/15 |
King Markov has some personality quirks:
He can’t stand record keeping. So he doesn’t know the populations on his islands and doesn’t keep track of which islands he has visited when.
He can’t stand routine (variety is the spice of his life), so he doesn’t want to know each night where he will be the next night.
He asks Adviser Metropolis to devise a way for him to obey the law but that
He can ask the clerk on any island what the island’s population is whenever he needs to know. But it takes half a day to sail from one island to another, so he is limited in how much information he can obtain this way each day.
Metropolis devises the following scheme:
Each morning, have breakfast with the island clerk and inquire about the population of the current island.
Then randomly pick one of the 4 other islands (a proposal island) and travel there in the morning
Let \(J(b \mid a)\) be the conditional probability of selecting island \(b\) as the proposal island if \(a\) is the current island.
\(J(b \mid a)\) may not depend on the populations of the islands (since the king can’t remember them).
Over lunch at the proposal island, inquire about its population.
If the proposal island has more people, stay at the proposal island for the night (since the king should prefer more populated islands).
If the proposal island has fewer people, stay at the proposal island with probability \(A\), else return to the “current” island (ie, the previous night’s island).
Metropolis is convinced that for the right choices of \(J\) and \(A\), this will satisfy the law.
He quickly determines that \(A\) cannot be 0 and cannot be 1:
Question 2a. What happens if \(A = 0\)? What happens if \(A = 1\)?
Question 2b. Why must \(A\) depend on the populations of islands \(a\) and \(b\)? (Show that if \(A\) doesn’t depend on the populations, then Metropolis’ algorithm will not satisfy the law.)
If \(A = 0\), then the king will always stay at the larger of the two islands, so once he gets to the largest island, he will stay there for ever.
If \(A = 1\), it is a little subtler. In this case if the proposal island is larger, the king stays there (as he always does when the candidate island has larger population). If the proposal island is smaller, he also stays there, because \(A = 1\). So now the king’s tour is completely determined by which islands are proposed, which doesn’t depend on the population, so can’t possibly get things right.
In fact, that same argument holds for any \(A\) that doesn’t depend on the populations. Since the tour is independent of the populations, at best it could be correct for one set of population ratios – it can’t work in general.
Let’s denote \(A\) as \(A(b \mid a)\) to make this clear and let \(P(a)\) be the population of island \(a\).
Quick notation review:
\(J(b \mid a) =\) probability of choosing to journey to island \(b\) in the morning if King Markov was on island \(a\) the previous night. [J for journey.]
\(A(b \mid a) =\) the probability of staying on island \(b\) if King Markov has just journeyed there from island \(a\). [A for accept.]
\(P(a) =\) the population of island \(a\). [P for population.]
\(p(a) = \frac{p(a)}{\sum_x p(x)}=\) the proportion of the entire population that lives on island \(a\). [p for proportion.]
But how should \(A\) be chosen? If \(A\) is too large, the king will visit small islands too often. If \(A\) is too small, he will visit large islands too often.
Fortunately, Metropolis knows about Markov Chains. Unfortunately, some of you may not. So let’s learn a little bit about Markov Chains and then figure out how Metropolis should choose \(J\) and \(A\).
This is going to be very quick. You can learn more, if you are interested, by going to
Consider a random process that proceeds in discrete steps (often referred to as time). Let \(X_t\) represent the “state” of the process at time \(t\). Since this is a random process, \(X_t\) is random, and we can ask probability questions like ``What is the probability of being in state ____ at time ____?", ie,
\[ \mbox{What is $\mathrm{Pr}(X_t = x)$?} \]
If \[ \mathrm{Pr}(X_{t+1} = x \mid X_t = x_t, X_{t-1} = x_{t-1}, \dots , X_0 = x_0) = \mathrm{Pr}(X_{t+1} \mid X_{t} = x_t) \] then we say that the process is a Markov Chain. The intuition is that (the probabilities of) what happens next depends only on the current state and not on previous history.
The simplest version of a Markov Chain is one that is time-homogeneous:
\[ \mathrm{Pr}(X_{t+1} = b \mid X_t = a) = M_{ab} \] That is, the (conditional) probability of moving from state \(a\) to state \(b\) in one step is the same at every time. (\(M\) for Metropolis, and because we have already used \(P\) and \(p\), and because of the next section.)
A time-homogeneous Markov Chain can be represented by a square matrix \(M\) with
\[ M_{ij} = \mbox{probability of transition from state $i$ to state $j$ in one step} \] (This will be an infinite matrix if the state space in infinite, but we’ll start with simple examples with small, finite state spaces.) More generally, we will write \[ M^{(k)}_{ij} = \mbox{probability of transition from state $i$ to state $j$ in $k$ steps} \]
Small Example:
M <- rbind( c(0, 0.5, 0.5), c(0.25, 0.25, 0.5), c(0.5, 0.3, 0.2))
M %>% pander::pander()
0 | 0.5 | 0.5 |
0.25 | 0.25 | 0.5 |
0.5 | 0.3 | 0.2 |
Question 3:
How many states does this process have?
What is the probability of moving from state 1 to state 3 in 1 step?
What is the probability of moving from state 1 to state 3 in 2 steps? (Hint: what are the possible stopping points along the way?)
How do we obtain \(M^{(2)}\) from \(M\)?
How do we obtain \(M^{(k)}\) from \(M\)?
You should see that the probability can be written as a sum of products. That should remind of a dot product, which should remind you of matrix multiplication. What do you get when you mutliply \(M \codt M\)?
Question 4: The Metropolis Algorithm is a Markov process. Express your answers in terms of \(J\), \(A\), and \(P\) where necessary.
What are the states of the Metropolis algorithm?
If King Markov stays on Island 2 one night, what is the probability that he will stay on Island 3 the next night?
If King Markov stays on Island 3 one night, what is the probability that he will stay on Island 2 the next night?
What is the general formula for the probability of moving from island \(a\) to island \(b\) (in one step)?
\[\mathrm{Pr}(X_{t+1}=b \mid X_t = a) =\]
One state per island.
The interesting case here is when \(a \neq b\), so you can limit yourself to that case. (The case where \(a = b\) is a bit more complicated, but won’t be involved in our main line of argument. Sometimes you just get lucky that way.)
To work out the probabilities, remember that two things must happen to move from \(a\) to \(b\): The king must journey there in the morning and the proposal island must be accepted as the next evening location. Keep in mind that if \(P(b) > P(a)\), then \(A(b \mid a) = 1\). You may want to consider two cases: (a) \(P(b) > P(a)\) and (b) \(P(a) > P(b)\). (We’ll think about \(P(a) = P(b)\) a little bit later, but it won’t be a problem.)
A time-homogeneous Markov Chain, is called regular if there is a number \(k\) such that
Question 5a
Question 5b
Regular Markov Chains have a very nice property:
\[ \lim_{k \to \infty} M^{(k)} = W \ \ \mbox{and every row of $W$ is the same} \]
This says that, no matter where you start the process, the long-run probability of being in each state will be the same.
In our small example above, convergence is quite rapid:
M %^% 20
## [,1] [,2] [,3]
## [1,] 0.2769231 0.3384615 0.3846154
## [2,] 0.2769231 0.3384615 0.3846154
## [3,] 0.2769231 0.3384615 0.3846154
M %^% 21
## [,1] [,2] [,3]
## [1,] 0.2769231 0.3384615 0.3846154
## [2,] 0.2769231 0.3384615 0.3846154
## [3,] 0.2769231 0.3384615 0.3846154
Note: If we apply the matrix \(M\) to the limiting probability (\(w\), one row of \(W\)), we just get \(w\) back again:
\[w M = w\]
W <- M %^% 30
W[1,]
## [1] 0.2769231 0.3384615 0.3846154
W[1,] %*% M
## [,1] [,2] [,3]
## [1,] 0.2769231 0.3384615 0.3846154
In fact, this is a necessary and sufficient condition for the limiting probability.
So, here’s what Metropolis needs to do: Choose \(J\) and \(A\) so that
His algorithm is a regular Markov Chain with matrix \(M\).
If \(w = \langle p(1), p(2), p(3), p(4), p(5) \rangle\) is the law-prescribed probabilities for island harboring, then \(w M = w\).
If \(A (b \mid a) > 0\), and the jumping rule allows us to get to all the islands (eventually), then the Markov Chain will be regular, so there will be a limiting distribution. But the limiting distribution must be the one the law requires.
Let \(w = \langle p(1), p(2), ..., p(5)\rangle\) be the legally prescribed probability vector. It suffices to show that
\[wM = w\] (This requires some justification, but intuitively it should seem plausible that if there is a limiting matrix with each row equal to \(w\), then as we get close to the limit, things shouldn’t change much, and perhaps at the limit, things don’t change at all. For now, you just have to trust the folks who know Markov chains…)
So what does \(wM = w\) say? It says that if the probabilities are correct at step \(t\), then then will be correct again at step \(t + 1\), since multiplying by \(M\) does one step of the Markov chain. So we want to show that
\[ \mathrm{Pr}(X_t = a) = p(a) \mbox{ for all $a$ } \Rightarrow \mathrm{Pr}(X_{t+1} = a) = p(a) \mbox{ for all $a$} \]
Here’s the trick: We will choose \(J\) and \(A\) so that the following two unconditional probabilities are equal.
\[ \mathrm{Pr}(a \to_t b) = \mathrm{Pr}(b \to_t a) \] where \(\mathrm{Pr}(a \to_t b) = \mathrm{Pr}(X_t = a \ \& \ X_{t+1} = b)\).
Why does this work?
Suppose \(\mathrm{Pr}(X_t = a) = p(a)\) as the law prescribes.
\(\mathrm{Pr}(a \to_t b) = \mathrm{Pr}(b \to_t a)\) makes the joint distribution symmetric: For any \(a\) and any \(b\).
\[\mathrm{Pr}(X_{t} = a \ \& \ X_{t+1} = b) = \mathrm{Pr}(X_{t} = b \ \& \ X_{t+1} = a)\]
\[\mathrm{Pr}(X_t = a) = \mathrm{Pr}(X_{t+1} = a)\]
Now we just need to find \(J\) and \(A\) to make this happen.
How do we choose \(J\) and \(A\)? Time for some algebra (and probability)!
Recall the ingredients:
Question 6: Consider two islands – \(a\) and \(b\) – with \(P(b) > P(a)\). Assume that probability of being on island \(x\) is \(p(x)\). Calculate the following probabilities (in terms of things like \(p\), \(J\), and \(A\)).
Question 7: How do we choose \(A\) to make these probabilities equal? Will this work for any jumping rule \(J\)?
\[ A(a \mid b) = \phantom{\frac{p(a) J(b \mid a)}{p(b) J(a \mid b)}} \]
Question 8: Symmetric jump rules.
What happens to the formula for \(A(a \mid b)\) if the jumping rule is symmetric: \(J(b \mid a) = J(a \mid b)\)?
Remember, the king doesn’t like to remember stuff, and this means half as much stuff to remember about the jump rules.
Will the king like the new acceptance rule? (The king won’t be so happy if the simpler jump rule makes the acceptance rule a lot more complicated.)
Question 9: Constant jump rules.
Is it possible to do this if we require that \(J(y \mid x) = J(y' \mid x')\) for all \(y \neq x\) and \(y' \neq x'\)? (This would make life even easier for the king.)
For King Markov, what would \(J\) be if we did it this way?
The original Metropolis algorithm used symmetric jump rules. The later generalization (Metropolis-Hastings) employed non-symmetric jump rules to get better performance of the Markov Chain.
Question 10: What does this have to do with Bayesian inference?