\[ \operatorname{Pr}(A \mid B) = \frac{\operatorname{Pr}(\mbox{both})}{\operatorname{Pr}(\mbox{condition})} = \frac{\operatorname{Pr}(\mbox{joint})}{\operatorname{Pr}(\mbox{marginal})} = \frac{\operatorname{Pr}(A, B)}{\operatorname{Pr}(B)} \]

Rearranging:

\[\begin{align*} \operatorname{Pr}(A, B) &= \operatorname{Pr}(B) \cdot \operatorname{Pr}(A \mid B) \\ &= \operatorname{Pr}(A) \cdot \operatorname{Pr}(B \mid A) \end{align*}\]

More rearranging

\[\begin{align*} \operatorname{Pr}(B) \operatorname{Pr}(A \mid B) &= \operatorname{Pr}(A) \cdot \operatorname{Pr}(B \mid A) \\[5mm] \operatorname{Pr}(A \mid B) &= \frac{\operatorname{Pr}(A) \cdot \operatorname{Pr}(B \mid A)}{\operatorname{Pr}(B)} \end{align*}\]

And some different letters (and words):

\[\begin{align*} \operatorname{Pr}(H \mid D) &= \frac{\operatorname{Pr}(H) \cdot \operatorname{Pr}(D \mid H)}{\operatorname{Pr}(D)} \\[5mm] \operatorname{Pr}(\mbox{hypothesis} \mid \mbox{data}) &= \frac{\operatorname{Pr}(\mbox{hypothesis}) \cdot \operatorname{Pr}(\mbox{data} \mid \mbox{hypothesis})}{\operatorname{Pr}(\mbox{data})} \end{align*}\]

Back to those coins…

Suppose we have a coin with probability of heads = \(\theta\), which is either 0, 0.2, 0.4, 0.6, 0.8, or 1. We flip the coin 4 times and get HHTT.

We would like to know

\[ \operatorname{Pr}(\mbox{hypothesis} \mid \mbox{data}) = \operatorname{Pr}( \theta = 0.2 \mid HHTT) \] Using the formula above we get

\[\begin{align*} \operatorname{Pr}(\theta = 0.2 \mid HHTT) &= \frac{\operatorname{Pr}(\theta = 0.2) \cdot \operatorname{Pr}(HHTT \mid \theta = 0.2)}{\operatorname{Pr}(HHTT)} \end{align*}\]

This is (a version of) the key equation that drives all of Bayesian analysis. Of the things on the right side, which parts are easy, which are more challenging?

  • \(\operatorname{Pr}(\theta = 0.2) = \mbox{prior} = 1/6\)

    • This is what we know/believe about \(\theta\) before we collect the data.1


  • \(\operatorname{Pr}(HHTT \mid \theta = 0.2) = \mbox{likelihood} = 0.2 \cdot 0.2 \cdot 0.8 \cdot 0.8\)

    • This is generally “easy” given the model – the model must specify how data would be generated if the model is correct.

    • This measures how likely the data would be if \(\theta = 0.2\).

    • This is really \(\operatorname{Pr}(H) \cdot \operatorname{Pr}(HH \mid H) \cdot \operatorname{Pr}(HHT \mid HH) \cdot \operatorname{Pr}(HHTT \mid HHT)\), but since the coin tosses are independent, the conditional probabilities are the same as the marginal probabilities.


  • The only tricky bit is the denominator, and we’ll come back to that in just a moment.

    But first let’s compute the numerators for all 6 possible values of \(\theta\) – using R:



Using R to do mutliple things at once

Let’s put all of our arithmetic into a nice table:

tibble(                                  # create a data frame
  p = 0:5 / 5,                           # our six possible probabilities
  prior = 1/6,                           # this will get copied to every row
  likelihood = p * p * (1-p) * (1-p),    # we can refer back to previous columns
  numerator = prior * likelihood
) %>% pander()                           # make the output fancy
p prior likelihood numerator
0 0.1667 0 0
0.2 0.1667 0.0256 0.004267
0.4 0.1667 0.0576 0.0096
0.6 0.1667 0.0576 0.0096
0.8 0.1667 0.0256 0.004267
1 0.1667 0 0




What about that denominator?

Notice two things:

  1. It is the same for each row of our table.

  2. Since the posterior should be a distribution, the values of the posterior must sum to 1.

So we can simply rescale proportionally by dividing by the sum of the numerators.

tibble(
  p = 0:5 / 5,
  prior = 1/6,
  likelihood = p * p * (1-p) * (1-p),
  numerator = prior * likelihood,
  posterior = numerator / sum(numerator)
) %>% pander()
p prior likelihood numerator posterior
0 0.1667 0 0 0
0.2 0.1667 0.0256 0.004267 0.1538
0.4 0.1667 0.0576 0.0096 0.3462
0.6 0.1667 0.0576 0.0096 0.3462
0.8 0.1667 0.0256 0.004267 0.1538
1 0.1667 0 0 0

Still to come

In a real application, we won’t have a finite list of possibilities but will want to consider any possible value for \(\theta\). Come back next time to find out how we deal with that.


  1. This statement is a little bit too casual, but it will work for now to develop our intuition about priors and Bayesian inference. It would be better to say that the prior is just a part of the model, and that there are multiple considerations that go into the choice of prior.↩︎