\[ \operatorname{Pr}(A \mid B) = \frac{\operatorname{Pr}(\mbox{both})}{\operatorname{Pr}(\mbox{condition})} = \frac{\operatorname{Pr}(\mbox{joint})}{\operatorname{Pr}(\mbox{marginal})} = \frac{\operatorname{Pr}(A, B)}{\operatorname{Pr}(B)} \]
Rearranging:
\[\begin{align*} \operatorname{Pr}(A, B) &= \operatorname{Pr}(B) \cdot \operatorname{Pr}(A \mid B) \\ &= \operatorname{Pr}(A) \cdot \operatorname{Pr}(B \mid A) \end{align*}\]
More rearranging
\[\begin{align*} \operatorname{Pr}(B) \operatorname{Pr}(A \mid B) &= \operatorname{Pr}(A) \cdot \operatorname{Pr}(B \mid A) \\[5mm] \operatorname{Pr}(A \mid B) &= \frac{\operatorname{Pr}(A) \cdot \operatorname{Pr}(B \mid A)}{\operatorname{Pr}(B)} \end{align*}\]
And some different letters (and words):
\[\begin{align*} \operatorname{Pr}(H \mid D) &= \frac{\operatorname{Pr}(H) \cdot \operatorname{Pr}(D \mid H)}{\operatorname{Pr}(D)} \\[5mm] \operatorname{Pr}(\mbox{hypothesis} \mid \mbox{data}) &= \frac{\operatorname{Pr}(\mbox{hypothesis}) \cdot \operatorname{Pr}(\mbox{data} \mid \mbox{hypothesis})}{\operatorname{Pr}(\mbox{data})} \end{align*}\]
Suppose we have a coin with probability of heads = \(\theta\), which is either 0, 0.2, 0.4, 0.6, 0.8, or 1. We flip the coin 4 times and get HHTT.
We would like to know
\[ \operatorname{Pr}(\mbox{hypothesis} \mid \mbox{data}) = \operatorname{Pr}( \theta = 0.2 \mid HHTT) \] Using the formula above we get
\[\begin{align*} \operatorname{Pr}(\theta = 0.2 \mid HHTT) &= \frac{\operatorname{Pr}(\theta = 0.2) \cdot \operatorname{Pr}(HHTT \mid \theta = 0.2)}{\operatorname{Pr}(HHTT)} \end{align*}\]
This is (a version of) the key equation that drives all of Bayesian analysis. Of the things on the right side, which parts are easy, which are more challenging?
\(\operatorname{Pr}(\theta = 0.2) = \mbox{prior} = 1/6\)
\(\operatorname{Pr}(HHTT \mid \theta = 0.2) = \mbox{likelihood} = 0.2 \cdot 0.2 \cdot 0.8 \cdot 0.8\)
This is generally “easy” given the model – the model must specify how data would be generated if the model is correct.
This measures how likely the data would be if \(\theta = 0.2\).
This is really \(\operatorname{Pr}(H) \cdot \operatorname{Pr}(HH \mid H) \cdot \operatorname{Pr}(HHT \mid HH) \cdot \operatorname{Pr}(HHTT \mid HHT)\), but since the coin tosses are independent, the conditional probabilities are the same as the marginal probabilities.
The only tricky bit is the denominator, and we’ll come back to that in just a moment.
But first let’s compute the numerators for all 6 possible values of \(\theta\) – using R:
Let’s put all of our arithmetic into a nice table:
tibble( # create a data frame
p = 0:5 / 5, # our six possible probabilities
prior = 1/6, # this will get copied to every row
likelihood = p * p * (1-p) * (1-p), # we can refer back to previous columns
numerator = prior * likelihood
) %>% pander() # make the output fancy
p | prior | likelihood | numerator |
---|---|---|---|
0 | 0.1667 | 0 | 0 |
0.2 | 0.1667 | 0.0256 | 0.004267 |
0.4 | 0.1667 | 0.0576 | 0.0096 |
0.6 | 0.1667 | 0.0576 | 0.0096 |
0.8 | 0.1667 | 0.0256 | 0.004267 |
1 | 0.1667 | 0 | 0 |
Notice two things:
It is the same for each row of our table.
Since the posterior should be a distribution, the values of the posterior must sum to 1.
So we can simply rescale proportionally by dividing by the sum of the numerators.
tibble(
p = 0:5 / 5,
prior = 1/6,
likelihood = p * p * (1-p) * (1-p),
numerator = prior * likelihood,
posterior = numerator / sum(numerator)
) %>% pander()
p | prior | likelihood | numerator | posterior |
---|---|---|---|---|
0 | 0.1667 | 0 | 0 | 0 |
0.2 | 0.1667 | 0.0256 | 0.004267 | 0.1538 |
0.4 | 0.1667 | 0.0576 | 0.0096 | 0.3462 |
0.6 | 0.1667 | 0.0576 | 0.0096 | 0.3462 |
0.8 | 0.1667 | 0.0256 | 0.004267 | 0.1538 |
1 | 0.1667 | 0 | 0 | 0 |
In a real application, we won’t have a finite list of possibilities but will want to consider any possible value for \(\theta\). Come back next time to find out how we deal with that.
This statement is a little bit too casual, but it will work for now to develop our intuition about priors and Bayesian inference. It would be better to say that the prior is just a part of the model, and that there are multiple considerations that go into the choice of prior.↩︎