Welcome to Stat 341

# Welcome to Stat 341
## Day 1
### Randall Pruim

---

# What is Computational Bayesian Statistics?

---

## Computational Bayesian Statistics

1. **Computational** -- It takes advantage of some very clever algorithms.

* We'll start with simpler methods and work up to more complex methods.
  
  * Simpler methods will be chosen to help prepare us, but those methods don't scale.

2. **Bayesian** -- An interesting frame work for thinking about statistical modeling.

* The fundamental logic of Bayesian statistics is different from the frequentist
  approach you may have seen before.
  
  * Interesting for both practical and philosophical reasons.

3. **Statistics** -- Very flexible system for creating statistical models.

* Limited only by our imagination and our computational tools.
  
  * What you learn about modeling here will help even when building non-Bayesian 
  models.
  
---

## Designing Models

Quotes from MacElreath

> What researchers need is a some unified theory of golem [model] engineering,
a set of principles for designing, building, and refining special-purpose
statistical procedures... There are benefits in rethinking statistical inference
as a set of strategies, instead of as a set of pre-made tools.

&nbsp;

> If you don't understand how the golem processes information, then you can't
interpret the golem's output.

---

## Countdown to Bayesian Statistics

---

## 4 Tools

**1.** Bayesian data analysis

> Once we have defined the statistical model, Bayesian data analysis
forces a purely logical way of processing the data to produce inference.

**2.** Model comparison and prediction

> Prefer models that make good predictions.

> Fitting is easy; prediction is hard.

**3.** Multi-level models

> Multilevel regression deserves to be the default form of regression.

**4.** Graphical Causal Models

> Models that are casually incorrect can make better predictions than models that are causally correct.

---

## 3 Challenges

**1.** Understanding the "Bayesian Framework"

* prior, likelihood, posterior

**2.** Understanding modeling

* model = pattern + variability
  
* parameters and distributions

**3.** Understanding the computation
    
* wrangle data, describe model, process model, visulization
    
* understanding how and why algorithms can fail or perform badly

*Be sure to focus on understanding things well early on so we can build
on a solid foundation later.*

---

##  2 Examples

**1.** Mystery Coins

**2.** Modeling Height

<br>

We'll get back to these shortly.

---

## 1 Assignment

**Problem Set 1** is due on Friday

* Most homework will be submitted via [**Gradescope**](https://gradescope.com)

* Submit your work as a PDF
  
  * Tag each problem so it easy to find for grading
  
* PS 1 is a bit different
  
  *  `ggformula` tutorial (may be familiar to some of you)
  
  * Post a plot to a google presentation (challenge yourself a bit 
  if you are already familiar with R and `ggformula`)
  
  * Fill out the pre-course survey if you haven't already done so.
  
  * Detail on the [howework](../../hw) page.
  
  
---

## Some notes on R code

* Practical Bayesian data analysis is not possible without computational tools
    * Our main tools: R, Stan (other options exist)
    
 * Don't focus on coding details today.
    * Sometimes I won't even show the code.
    * But in the future, be sure you understand the computation!
    
 * We'll learn how to do all this (and more) in R as we go along.
 
    * Visualization and data wrangling will be important parts of the process.
 
---

## R packages

We will make use of a number of R packages as we go along. If you try to mimic
the code on your own machine, you will need to use some of these packages.

But for today, just focus on the big ideas, not the code details.

```r
library(ggformula)       # for creating plots
library(patchwork)       # for combining multiple plots
library(dplyr)           # for data wrangling
library(mosaic)          # includes the previous 2 (and some other stuff)
library(pander)          # for pretty tables
library(CalvinBayes)     # includes BernGrid() 
library(coda)            # tools for working with posterior samples
library(bayesplot)       # plots of Bayesian models
library(brms)            # used to fit the model in the second exmample, 
                         # but hidden from view here
theme_set(theme_bw())    # change the default graphics settings
```

---

# Example 1: Which coin is it?

---

## Situation

Here is a toy situation, just to get us started.

* Coin that is known to result in heads in either 0, 20, 40, 60, 80, or 100% of tosses.

* We don't know which probability is correct.

* Plan: Gather data by flipping the coin and recording the results.

* Goal: Quantify what we know about which coin it might be.

---

## Notation

* Let `$\theta$` be the true probability of tossing a head.
 
 * `$\theta$` is the only **parameter** in this simple model.
 
 * 6 possibilities:  `$\theta = 0$`, `$\theta = 0.2$`, `$\theta = 0.4$`, 
   `$\theta = 0.6$`, `$\theta = 0.8$`, and `$\theta = 1$`.

---

## Credibility/Plausibility

Before collecting our data, if have no other information, we will consider
each coin to be equally credible/plausible. We could represent that as follows.

---

## First Toss: H

Now suppose we toss the coin and obtain a head.  What does that do to our 
credibilities?

```r
Coins1 <- tibble(
    theta = (0:5) / 5,
    credibility = 0.2)

gf_col(credibility ~ theta, data = Coins1, fill = "steelblue", alpha = 0.5) %>%
  gf_labs(title = "Credibility before data") %>%
  gf_refine(scale_x_continuous(breaks = (0:5) / 5))
```

---

## First Toss: H

* `$\theta = 0$` is no longer possible, credibility is now 0.  
* Larger `$\theta$` values more credible than smaller ones.
* Sum of all credibilities should remain 1 (100%).
* We'll learn how to compute the new credibilities later.

---

## Prior and Posterior

* Statisticians don't call these distributions of credibility "before" and "after".
* Use the longer words **prior** and **posterior**, which mean the same thing.

---

## Second Toss: Another H

Now suppose we toss the coin again and get another head.

Once again we can update the credibility.

---

## Toss 3: A Tail

Time for a third toss.  This time we obtain a tail.

* Credibility of `$\theta = 1$` drops to 0 
 * Relative credibilities of the smaller values of `$\theta$` will increase 
 * Relative credibilities of the larger values of `$\theta$` will decrease.

---

## Fourth Toss: Another Tail

Finally, we flip one more tail.

As expected, the posterior is now symmetric with the two central values 
of `$\theta$` having the larger credibility.

---

## More and more data

* Each coin toss provides a bit more information with which to update 
the posterior,

* which becomes our new prior for subsequent data.

<br>

Take a look at the *Garden of Forking Paths* in *Statstical Rethinking*
for another way to look at basically this same example.

---

## BernGrid()

The `BernGrid()` function in the `CalvinBayes`
package makes it easy to generate plots similar to the ones above.

```r
BernGrid("HHTTTHTTT",                        # the data
            steps = TRUE,                    # show each step
            p = c(0, 0.2, 0.4, 0.6, 0.8, 1)  # possible probabilities
)
```

---

# Example 1A: Freedom of choice

---

## Possible parameter values?

* Usually not given a small number of possible values for a parameter.
* `$\theta$` could be any value between 0 and 1.
* Bayesian updating happens essentially the same way.

```r
BernGrid("HHTTTHTTT", steps = TRUE)
```

---

# Detour: Density

---

## Probability Density Functions (pdfs)

Density functions have the properties that

* they are never negative, and 
  * the total area under the curve is 1.

Where the density curve is taller, values are more likely/credible.

<img src="Stat341-Day01_files/figure-html/ch02-coins-cred-08a-1.png" style="display: block; margin: auto;" />
  
So in the last
posterior credibility above, we see that values near 1/3 are the most credible
while values below 0.15 or above 0.65 are not very credible.

---

## Where do densities come from?

Many of the densities in this course will arise computationally.

* We won't have formulas or names for these
 * Often all we will have is random samples from the distribution
 
We will also encounter densities with names

* "normal", "beta", "t", etc.
  * `ggformula::gf_dist()` can be used to plot named distributions.
  
---

## Beta distributions

The curves in our coins example above look a lot like beta distributions.
In fact, we will eventually learn that they are beta distributions, and that
each new observed coin toss increases either `shape1` or `shape2` by 1.

```r
gf_dist("beta", shape1 = 1, shape2 = 1, color = "gray50") %>%
gf_dist("beta", shape1 = 2, shape2 = 1, color = "red") %>%
gf_dist("beta", shape1 = 3, shape2 = 1, color = "orange") %>%
gf_dist("beta", shape1 = 3, shape2 = 2, color = "forestgreen") %>%
gf_dist("beta", shape1 = 3, shape2 = 3, color = "navy")
```

---

## Normal distributions

* bell-shaped, symmetric 
* centered at the mean `$\mu$`.
* a second paramter, the standard deviation `$\sigma$` quantifies spread

```r
gf_dist("norm", mean = 10, sd = 1, color = "steelblue") %>%
gf_dist("norm", mean = 10, sd = 2, color = "red")
```

The red curve is "twice as spread out" as the blue one.

---

# Example 2: Height vs Weight

---

## Height vs Weight

---

### Data

* Height and weight for sample of 40-year-old Americans.
* Source: NHANES study (`NHANES::NHANES`)

???

Different data set used in Kruschke

---

## Model for relationship between height and weight

A plausible model is that weight is *roughly* linearly related to height.

$$
\mbox{weight} \approx \beta_0 + \beta_1 \mbox{height}
$$

(More about `$\approx$` in just a moment.)

The simplest model makes two claims about the relationship between weight and height.
---

## Claim 1 is about averages

**Claim 1:** The **average weight** of people with height `$x$` is some linear function
of `$x$`. We can express this as

$$
\mu_{Y \mid x} = E(Y \mid x) =  \beta_0 + \beta_1 x 
$$
or

$$
\mu_{\mbox{weight} \mid \mbox{height}} = E(\mbox{weight} \mid \mbox{height}) =  
    \beta_0 + \beta_1 \cdot \mbox{height}
$$

**Notes**

* The capital `$Y$` indicates that it has a distribution.  
* `$x$` is lower case because we are imagining a specific value there.  
* So for each value of `$x$`, the there is a distribution of `$Y$`'s.
* Prefer longer names for specific examples; use `$x$` and `$Y$` for "generic examples".
* Statisticians like to use `$\beta_0$` and `$\beta_1$` for intercept and slope.

---

## Claim 2 is about distribution

**Claim 2:** But **not everyone is average**:

* weights of people with a given height are symmetrically distributed
around the average weight for that height.

* the distribution is normal (bell-shaped).

We express this as

$$
Y \mid x \sim {\sf Norm}(\mu_{Y|x}, \sigma)
$$

* `$\sim$` = "is distributed as" 
  
---

## All together now

Putting this all together, and being a little bit sloppy we might write it
this way:

`\begin{align*}
Y &\sim {\sf Norm}(\mu, \sigma) 
\\
\mu  &= \beta_0 + \beta_1 x
\end{align*}`

In this style the dependence of `$Y$` on `$x$` is implicit (via `$\mu$`'s dependence on
`$x$`) and we save writing `$\mid x$` in a few places.

This model has 3 **parameters**: `$\beta_0$`, `$\beta_1$`, `$\sigma$`.

* `$\mu$` can be avoided if we write this as

$$
Y \sim {\sf Norm}(\beta_0 + \beta_1 x, \sigma) 
$$

---

## Not done yet -- more distributions

A **prior distribution** describes what is known/believed about the parameters 
before we use the information from our data. For this example, so our full
model looks something like

`\begin{align*}
Y &\sim {\sf Norm}(\mu, \sigma) 
\\
\mu  &= \beta_0 + \beta_1 x
\\
\beta_0 & \sim \ ???
\\
\beta_1 & \sim \ ???
\\
\sigma & \sim \ ???
\end{align*}`

We'll be learning how to select reasonable priors for parameters as we go along.

---

## Posterior

For now we won't worry about how the posterior distribution is computed, but 
we can inspect it visually and numerically. (It is called `Post` in the R code below.)

```r
gf_density( ~ b_height,    data = Post, alpha = 0.5) |
gf_density( ~ b_Intercept, data = Post, alpha = 0.5) |
gf_density( ~ sigma,       data = Post, alpha = 0.5) 
```

```r
df_stats(~ b_height + b_Intercept + sigma, data = Post, mean, sd)
```

```
##      response        mean         sd
## 1    b_height    7.236983  0.9033625
## 2 b_Intercept -295.504391 61.3997226
## 3       sigma   43.324469  2.5384608
```

---

## Posterior

```r
hdi(Post, pars = "b_height")
```

```
##        par       lo       hi     mode prob
## 1 b_height 5.436719 9.021818 7.349414 0.95
```

```r
mcmc_areas(as.mcmc(Post), pars = "b_height", prob = 0.95)
```

---

## More Posterior Distributions

Posterior distributions for parameters provide posterior distributions for 
other things as well.

* Posterior distribution for the lines 
(using `$\beta_0$`, and `$\beta_1$`)

* Posterior distribution for the weight of a person who is 6 feet tall
(using `$\beta_0$`, `$\beta_1$`, `$\sigma$` and normal distributions)

* etc.

Typically we learn about these via **posterior sampling** -- randomly selecting
values from the posterior distribution for the parameters and combining them
to get what we are interested in.

---

## Posterior Lines

* Each line represents a plausible (according to the posterior distribution)
combination of slope and intercept.

* 100 such lines are included in the plot below.

* Note: These lines represent what we know about **average** weights, and many people are not 
 average.
 
---

## Posterior Predictive Check

A posterior predictive check is a way of checking that the data look like
they could have been plausibly generated by our model.

To generate a simulated weight for a given height:

- select a value for `$x$` (height).
 - randomly select `$\beta_0$`, `$\beta_1$`, `$\sigma$` from posterior.
 - use `$\beta_0$` and `$\beta_1$` to generate average weight.
 - use using the normal distribution (ie, `$\sigma$`) to generate a difference between
an individual weight and the average weight.

---

## Posterior Predictive Check -- Computation

For example, here are the first two rows of our posterior distribution:

```r
Post %>% head(2)
```

```
##   b_Intercept b_height    sigma
## 1   -226.2681 6.205661 39.30596
## 2   -239.8352 6.495962 42.08519
```

Simulated weight:

* row 1: select from `${\sf Norm}(-226.2681 + 6.205661 \cdot 65, 39.30596)$` 
 * row 2: select from `${\sf Norm}(-239.8352 + 6.495962 \cdot 65, 42.08519)$`

---

## Posterior Predictive Check -- Computation

Simulated weight:

* row 1: select from `${\sf Norm}(-226.2681 + 6.205661 \cdot 65, 39.30596)$` 
 * row 2: select from `${\sf Norm}(-239.8352 + 6.495962 \cdot 65, 42.08519)$`

```r
Post %>% head(2) %>%
  mutate(pred_y = rnorm(2, mean = b_Intercept + b_height * 65, sd = sigma))
```

```
##   b_Intercept b_height    sigma   pred_y
## 1   -226.2681 6.205661 39.30596 162.9508
## 2   -239.8352 6.495962 42.08519 122.4926
```

---

## Posterior Predictive Check -- Plot

If we do this many times for several height values and plot the central
95% of the weights, we get a plot that looks like this:

By taking many different credible values of the parameters (including `$\sigma$`), 
we are taking into account

* our uncertainty about the parameter values, and
 * the variability the model describes in the population (even for given
parameter values).

---

## Where do we go from here?

Now that we have seen an overview of Bayesian inference at work, you probably have 
lots of questions.  Over time we will improve our answers to each of them.

1. How do we create models?

2. How do we select priors?

3. How do we update the prior based on data to get the posterior?

4. How do we tell whether the algorithm that generated the posterior worked well?

5. How do we tell whether the model worked well?

6. What can we do with the posterior once we have it?