Bayes by Math

Beta distributions

A few important facts about beta distributions

two parameters: \(\alpha = a =\) shape1; \(\beta = b =\) shape2.
kernel: \(x^{\alpha - 1} (1-x)^{\beta -1}\) on \([0, 1]\)
area under the kernel: \(B(a, b)\) [\(B()\) is the beta function, beta() in R]
scaling contant: \(1 / B(a, b)\)

We can use gf_dist() to see what a beta distribution looks like.

gf_dist("beta", shape1 = 5, shape2 = 3, color = ~ "Beta(5, 3)") %>% 
  gf_dist("beta", shape1 = 3, shape2 = 5, color = ~ "Beta(3, 5)") %>%
  gf_dist("beta", shape1 = 0.9, shape2 = 0.9, color = ~ "Beta(0.9, 0.9)") %>%
  gf_dist("beta", shape1 = 0.9, shape2 = 1.1, color = ~ "Beta(0.9, 1.1)") %>%
  gf_labs(title = "Some Beta distributions", color = "distribution")

Beta and Bayes

Suppose we want to estimate a proportion \(\theta\) by repeating some random process (like a coin toss) \(N\) times. We will code each result using a 0 (failure) or a 1 (success): \(Y_1, Y_2, \dots, Y_N\). Here’s our model.
The prior, to be determined shortly, is indicated as ??? for the moment. \[\begin{align*} Y_i & \sim {\sf Bern}(\theta) = {\sf Binom}(1, \theta) \\ \theta & \sim{} ??? \end{align*}\]

The Bernoulli likelihood function

The first line turns into the following likelihood function – the probability of observing \(y_i\) for a given parameter value \(\theta\):

\[\begin{align*} \operatorname{Pr}(Y_i = y_i \mid \theta) = p(y_i \mid \theta) &= \begin{cases} \theta & y_i = 1 \\ (1-\theta) & y_i = 0 \end{cases} \\ &= \theta^{y_i} (1-\theta)^{y_i} \end{align*}\]

The likelihood for the entire data set is then \[\begin{align*} p(\langle y_1, y_2, \dots, y_N \rangle \mid \theta) &= \prod_{i = 1}^N \theta^{y_i} (1-\theta)^{y_i} \\ &= \theta^{x} (1-\theta)^{N - x} \end{align*}\] where \(x\) is the number of “successes” and \(N\) is the number of trials. Since the likelihood only depends on \(x\) and \(N\), not the particular order in which the 0’s and 1’s are observed, we will write the likelihood as

\[\begin{align*} p(x, N \mid \theta) &= \theta^{x} (1-\theta)^{N - x} \end{align*}\]

Reminder: If we think of this expression as a function of \(\theta\) for fixed data (rather than as a function of the data for fixed \(\theta\)), we see that it is the kernel of a \({\sf Beta}(x + 1, N - x + 1)\) distribution. But even thought of this way, the likelihood need not be a PDF – the total sum or integral need not be 1. But we will sometimes normalize likelihood functions if we want to display them on plots with priors and posteriors.

A convenient prior

Now let think about our posterior:

\[\begin{align*} p(\theta \mid x, N) & = \overbrace{p(x, N \mid \theta)}^{\mathrm{likelihood}} \cdot \overbrace{p(\theta)}^{\mathrm{prior}} / p(x, N) \\ & = {\theta^x (1-\theta)^{N - x}} \cdot {p(\theta)} / p(x, N) \end{align*}\] If we let \(p(\theta) = \theta^a (1-\theta^b)\), the product is epsecially easy to evaluate: \[\begin{align*} p(\theta \mid x, N) & = \overbrace{p(x, N \mid \theta)}^{\mathrm{likelihood}} \cdot \overbrace{p(\theta)}^{\mathrm{prior}} / p(x, N) \\ & = {\theta^x (1-\theta)^{N - x}} \cdot {\theta^a (1-\theta)^b)} / p(x, N) \\ & = {\theta^{x+a} (1-\theta)^{N - x + b}} / p(x, N) \end{align*}\] In this happy situation, when mutlipying the likelihood and the prior leads to a posterior with same form as the prior, we say that the prior is a conjugate prior (for that particular likelihood function). So beta priors are conjugate priors for the Bernoulli likelihood, and if we use a beta prior, we will get a beta posterior and it is easy to calculate which one:

prior	data	posterior
\(\sf{Beta}(a, b)\)	\(x, N\)	\({\sf Beta}(x + a, N - x + b)\)

Pros and Cons of conjugate priors

Pros: Easy and fast calculation; can reason about the relationship between prior, likelihood, and posterior based on a known distributions.

Cons: We are restricted to using a conjugate prior, and that isn’t always the prior we want; many situations don’t have natural conjugate priors available; the computations are often not as simple as in our current example.

Getting to know the Beta distributions

Important facts

You can often look up this sort of information on the Wikipedia page for a family of distributions. If you go to https://en.wikipedia.org/wiki/Beta_distribution you will find, among other things, the following:


Notation	Beta(\(\alpha, \beta\))
Parameters	\(\alpha > 0\) shape (real) \(\beta > 0\) shape (real)
Support	\(\displaystyle x\in [0,1]\) or \(\displaystyle x\in (0,1)\)
PDF	\(\displaystyle \frac{x^{\alpha -1}(1-x)^{\beta -1}}{\mathrm{B}(\alpha, \beta )}\)
Mean	\(\displaystyle \frac{\alpha }{\alpha +\beta }\)
Mode	\(\displaystyle {\frac{\alpha -1}{\alpha +\beta -2}}\) for \(\alpha, \beta > 1\) 0 for \(\alpha = 1, \beta > 1\) 1 for \(\alpha > 1, \beta = 1\)
Variance	\(\displaystyle \frac{\alpha \beta}{(\alpha +\beta )^{2}(\alpha +\beta +1)}\)
Concentration	\(\displaystyle \kappa =\alpha +\beta\)

Alternative parameterizations of Beta distributions

There are several different parameterizations of the beta distributions that can be helpful in selecting a prior or interpreting a posterior.

Mode and concentration

Let the concentration be defined as \(\kappa =\alpha +\beta\). Since the mode (\(\omega\)) is \(\displaystyle {\frac{\alpha -1}{\alpha +\beta -2}}\) for \(\alpha, \beta > 1\), we can solve for \(\alpha\) and \(\beta\) to get

\[\begin{align} \alpha &= \omega (\kappa - 2) + 1\\ \beta &= (1-\omega)(\kappa -2) + 1 \end{align}\]

Mean and concentration

The beta distribution may also be reparameterized in terms of its mean \(\mu\) and the concentration \(\kappa\). If we solve for \(\alpha\) and \(\beta\), we get

\[\begin{align} \alpha &= \mu \kappa \\ \beta &= (1 - \mu) \kappa \end{align}\]

Mean and variance (or standard deviation)

We can also parameterize with the mean \(\mu\) and variance \(\sigma^2\). Solving the system of equations for mean and variance given in the table above, we get

\[\begin{align} \kappa &=\alpha +\beta = \frac{\mu (1-\mu )}{\sigma^2} - 1 \\ \alpha &=\mu \kappa = \mu \left({\frac{\mu (1-\mu )}{\sigma^2}}-1\right) \\ \beta &= (1-\mu )\kappa = (1-\mu ) \left({\frac{\mu (1-\mu)} {\sigma^2}} -1 \right), \end{align}\] provided \(\sigma^2 < \mu (1-\mu)\).

beta_params()

CalvinBayes::beta_params() will compute several summaries of a beta distribution given any of these 2-parameter summaries. This can be very handy for converting from one type of information about a beta distribution to another.

For example. Suppose you want a beta distribution with mean 0.3 and standard deviation 0.1. Which beta distribution is it?

library(CalvinBayes)
beta_params(mean = 0.3, sd = 0.1)

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## # A tibble: 1 x 6
##   shape1 shape2  mean  mode    sd concentration
##    <dbl>  <dbl> <dbl> <dbl> <dbl>         <dbl>
## 1   6.00   14.0   0.3 0.278   0.1          20.0

We can do a similar thing with other combinations.

bind_rows(
  beta_params(mean = 0.3, concentration = 10),
  beta_params(mode = 0.3, concentration = 10),
  beta_params(mean = 0.3, sd = 0.2),
  beta_params(shape1 = 5, shape2 = 10),
)

## # A tibble: 4 x 6
##   shape1 shape2  mean  mode    sd concentration
##    <dbl>  <dbl> <dbl> <dbl> <dbl>         <dbl>
## 1   3      7    0.3   0.25  0.138         10   
## 2   3.4    6.6  0.34  0.3   0.143         10   
## 3   1.27   2.97 0.3   0.122 0.2            4.25
## 4   5     10    0.333 0.308 0.118         15

Automating Bayesian updates for a proportion (beta prior)

Since we have formulas for this case, we can write a function handle any beta prior and any data set very simply. (Much simpler than doing the grid method each time).

quick_bern_beta <- 
  function(
    x, n,     # data, successes and trials
    ...       # see clever trick below
  ) 
  {
    pars <- beta_params(...)
    a <- pars$shape1
    b <- pars$shape2
    
    theta_hat <- x / n  # value that makes likelihood largest
    posterior_mode <- (a + x - 1) / (a + b + n - 2)  
    
    # scale likelihood to be as tall as the posterior
    likelihood <- function(theta) {
      dbinom(x, n, theta) / dbinom(x, n, theta_hat) *
        dbeta(posterior_mode, a + x, b + n - x)  # posterior height at mode
    }
    
    gf_dist("beta", shape1 = a, shape2 = b, 
            color = ~ "prior", alpha = 0.5, xlim = c(0,1), size = 1.2) %>%
      gf_function(likelihood, 
                  color = ~ "likelihood", alpha = 0.5, size = 1.2) %>%
      gf_dist("beta", shape1 = a + x, shape2 = b + n - x, 
              color = ~ "posterior", alpha = 0.5, size = 1.6) %>%
      gf_labs(
        color = "function",
        title = paste0("posterior: Beta(", a + x, ", ", b + n - x, ")")
      ) %>% 
      gf_refine(
        scale_color_manual(
          values = c("prior" = "gray50", "likelihood" = "forestgreen", 
                     "posterior" = "steelblue")))
  }

With such a function in hand, we can explore examples very quickly. Here are three examples from DBDA2e (pp. 134-135).

quick_bern_beta(17, 20, mode = 0.5, k = 500)

quick_bern_beta(17, 20, mode = 0.75, k = 25)

quick_bern_beta(17, 20, a = 1, b = 1)

What if the prior isn’t a beta distribution?

Unless it is some other distribution where we can work things out mathematically, we are back to the grid method.

Here’s an example like the one on page 136.

dtwopeaks <- function(x) {
  0.48 * triangle::dtriangle(x, 0.2, 0.3) +
  0.48 * triangle::dtriangle(x, 0.7, 0.8) +
  0.04 * dunif(x)
}

BernGrid(data = c(rep(0, 13), rep(1, 14)), prior = dtwopeaks) %>%
  gf_function(function(theta) 0.3 * dbinom(13, 27, theta), color = "forestgreen")

Inferring a Binomial Probability via Exact Mathematical Analysis

Stat 341