class: center, middle, inverse, title-slide # Welcome to Stat 341 ## Day 1 ### Randall Pruim --- class: inverse, center, middle # What is Computational Bayesian Statistics? --- ## Computational Bayesian Statistics 1. **Computational** -- It takes advantage of some very clever algorithms. * We'll start with simpler methods and work up to more complex methods. * Simpler methods will be chosen to help prepare us, but those methods don't scale. -- 2. **Bayesian** -- An interesting frame work for thinking about statistical modeling. * The fundamental logic of Bayesian statistics is different from the frequentist approach you may have seen before. * Interesting for both practical and philosophical reasons. -- 3. **Statistics** -- Very flexible system for creating statistical models. * Limited only by our imagination and our computational tools. * What you learn about modeling here will help even when building non-Bayesian models. --- ## Designing Models Quotes from MacElreath > What researchers need is a some unified theory of golem [model] engineering, a set of principles for designing, building, and refining special-purpose statistical procedures... There are benefits in rethinking statistical inference as a set of strategies, instead of as a set of pre-made tools. > If you don't understand how the golem processes information, then you can't interpret the golem's output. --- class: inverse, center, middle ## Countdown to Bayesian Statistics --- ## 4 Tools **1.** Bayesian data analysis > Once we have defined the statistical model, Bayesian data analysis forces a purely logical way of processing the data to produce inference. -- **2.** Model comparison and prediction > Prefer models that make good predictions. -- > Fitting is easy; prediction is hard. -- **3.** Multi-level models > Multilevel regression deserves to be the default form of regression. -- **4.** Graphical Causal Models > Models that are casually incorrect can make better predictions than models that are causally correct. --- ## 3 Challenges **1.** Understanding the "Bayesian Framework" * prior, likelihood, posterior -- **2.** Understanding modeling * model = pattern + variability * parameters and distributions -- **3.** Understanding the computation * wrangle data, describe model, process model, visulization * understanding how and why algorithms can fail or perform badly -- *Be sure to focus on understanding things well early on so we can build on a solid foundation later.* --- ## 2 Examples **1.** Mystery Coins **2.** Modeling Height <br> We'll get back to these shortly. --- ## 1 Assignment **Problem Set 1** is due on Friday * Most homework will be submitted via [**Gradescope**](https://gradescope.com) * Submit your work as a PDF * Tag each problem so it easy to find for grading * PS 1 is a bit different * `ggformula` tutorial (may be familiar to some of you) * Post a plot to a google presentation (challenge yourself a bit if you are already familiar with R and `ggformula`) * Fill out the pre-course survey if you haven't already done so. * Detail on the [howework](../../hw) page. --- ## Some notes on R code * Practical Bayesian data analysis is not possible without computational tools * Our main tools: R, Stan (other options exist) * Don't focus on coding details today. * Sometimes I won't even show the code. * But in the future, be sure you understand the computation! * We'll learn how to do all this (and more) in R as we go along. * Visualization and data wrangling will be important parts of the process. --- ## R packages We will make use of a number of R packages as we go along. If you try to mimic the code on your own machine, you will need to use some of these packages. But for today, just focus on the big ideas, not the code details. ```r library(ggformula) # for creating plots library(patchwork) # for combining multiple plots library(dplyr) # for data wrangling library(mosaic) # includes the previous 2 (and some other stuff) library(pander) # for pretty tables library(CalvinBayes) # includes BernGrid() library(coda) # tools for working with posterior samples library(bayesplot) # plots of Bayesian models library(brms) # used to fit the model in the second exmample, # but hidden from view here theme_set(theme_bw()) # change the default graphics settings ``` --- class: inverse, middle, center # Example 1: Which coin is it? --- ## Situation Here is a toy situation, just to get us started. * Coin that is known to result in heads in either 0, 20, 40, 60, 80, or 100% of tosses. * We don't know which probability is correct. * Plan: Gather data by flipping the coin and recording the results. * Goal: Quantify what we know about which coin it might be. --- ## Notation * Let `\(\theta\)` be the true probability of tossing a head. * `\(\theta\)` is the only **parameter** in this simple model. * 6 possibilities: `\(\theta = 0\)`, `\(\theta = 0.2\)`, `\(\theta = 0.4\)`, `\(\theta = 0.6\)`, `\(\theta = 0.8\)`, and `\(\theta = 1\)`. --- ## Credibility/Plausibility Before collecting our data, if have no other information, we will consider each coin to be equally credible/plausible. We could represent that as follows. <img src="Stat341-Day01_files/figure-html/ch02-coin-cred-01-1.png" style="display: block; margin: auto;" /> --- ## First Toss: H Now suppose we toss the coin and obtain a head. What does that do to our credibilities? ```r Coins1 <- tibble( theta = (0:5) / 5, credibility = 0.2) gf_col(credibility ~ theta, data = Coins1, fill = "steelblue", alpha = 0.5) %>% gf_labs(title = "Credibility before data") %>% gf_refine(scale_x_continuous(breaks = (0:5) / 5)) ``` <img src="Stat341-Day01_files/figure-html/ch02-coin-cred-01a-1.png" style="display: block; margin: auto;" /> --- ## First Toss: H * `\(\theta = 0\)` is no longer possible, credibility is now 0. * Larger `\(\theta\)` values more credible than smaller ones. * Sum of all credibilities should remain 1 (100%). * We'll learn how to compute the new credibilities later. <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-02-1.png" style="display: block; margin: auto;" /> --- ## Prior and Posterior * Statisticians don't call these distributions of credibility "before" and "after". * Use the longer words **prior** and **posterior**, which mean the same thing. <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-03-1.png" style="display: block; margin: auto;" /> --- ## Second Toss: Another H Now suppose we toss the coin again and get another head. Once again we can update the credibility. <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-04-1.png" style="display: block; margin: auto;" /> --- ## Toss 3: A Tail Time for a third toss. This time we obtain a tail. * Credibility of `\(\theta = 1\)` drops to 0 * Relative credibilities of the smaller values of `\(\theta\)` will increase * Relative credibilities of the larger values of `\(\theta\)` will decrease. <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-05-1.png" style="display: block; margin: auto;" /> --- ## Fourth Toss: Another Tail Finally, we flip one more tail. <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-06-1.png" style="display: block; margin: auto;" /> As expected, the posterior is now symmetric with the two central values of `\(\theta\)` having the larger credibility. --- ## More and more data * Each coin toss provides a bit more information with which to update the posterior, * which becomes our new prior for subsequent data. <br> Take a look at the *Garden of Forking Paths* in *Statstical Rethinking* for another way to look at basically this same example. --- ## BernGrid() The `BernGrid()` function in the `CalvinBayes` package makes it easy to generate plots similar to the ones above. ```r BernGrid("HHTTTHTTT", # the data steps = TRUE, # show each step p = c(0, 0.2, 0.4, 0.6, 0.8, 1) # possible probabilities ) ``` <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-07-1.png" style="display: block; margin: auto;" /> --- class: inverse, middle, center # Example 1A: Freedom of choice --- ## Possible parameter values? * Usually not given a small number of possible values for a parameter. * `\(\theta\)` could be any value between 0 and 1. * Bayesian updating happens essentially the same way. ```r BernGrid("HHTTTHTTT", steps = TRUE) ``` <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-08-1.png" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Detour: Density --- ## Probability Density Functions (pdfs) <!-- The (prior and posterior) distributions in the previous plots were --> <!-- calculated numerically using a Bayesian update rule that we will soon learn. --> Density functions have the properties that * they are never negative, and * the total area under the curve is 1. Where the density curve is taller, values are more likely/credible. <img src="Stat341-Day01_files/figure-html/ch02-coins-cred-08a-1.png" style="display: block; margin: auto;" /> So in the last posterior credibility above, we see that values near 1/3 are the most credible while values below 0.15 or above 0.65 are not very credible. --- ## Where do densities come from? Many of the densities in this course will arise computationally. * We won't have formulas or names for these * Often all we will have is random samples from the distribution We will also encounter densities with names * "normal", "beta", "t", etc. * `ggformula::gf_dist()` can be used to plot named distributions. --- ## Beta distributions The curves in our coins example above look a lot like beta distributions. In fact, we will eventually learn that they are beta distributions, and that each new observed coin toss increases either `shape1` or `shape2` by 1. ```r gf_dist("beta", shape1 = 1, shape2 = 1, color = "gray50") %>% gf_dist("beta", shape1 = 2, shape2 = 1, color = "red") %>% gf_dist("beta", shape1 = 3, shape2 = 1, color = "orange") %>% gf_dist("beta", shape1 = 3, shape2 = 2, color = "forestgreen") %>% gf_dist("beta", shape1 = 3, shape2 = 3, color = "navy") ``` <img src="Stat341-Day01_files/figure-html/ch02-beta-01-1.png" style="display: block; margin: auto;" /> --- ## Normal distributions * bell-shaped, symmetric * centered at the mean `\(\mu\)`. * a second paramter, the standard deviation `\(\sigma\)` quantifies spread ```r gf_dist("norm", mean = 10, sd = 1, color = "steelblue") %>% gf_dist("norm", mean = 10, sd = 2, color = "red") ``` <img src="Stat341-Day01_files/figure-html/ch02-norm-01-1.png" style="display: block; margin: auto;" /> The red curve is "twice as spread out" as the blue one. --- class: center, middle, inverse # Example 2: Height vs Weight --- ## Height vs Weight --- ### Data * Height and weight for sample of 40-year-old Americans. * Source: NHANES study (`NHANES::NHANES`) <img src="Stat341-Day01_files/figure-html/ch02-nhanes-02-1.png" style="display: block; margin: auto;" /> ??? Different data set used in Kruschke --- ## Model for relationship between height and weight A plausible model is that weight is *roughly* linearly related to height. $$ \mbox{weight} \approx \beta_0 + \beta_1 \mbox{height} $$ (More about `\(\approx\)` in just a moment.) The simplest model makes two claims about the relationship between weight and height. --- ## Claim 1 is about averages **Claim 1:** The **average weight** of people with height `\(x\)` is some linear function of `\(x\)`. We can express this as $$ \mu_{Y \mid x} = E(Y \mid x) = \beta_0 + \beta_1 x $$ or $$ \mu_{\mbox{weight} \mid \mbox{height}} = E(\mbox{weight} \mid \mbox{height}) = \beta_0 + \beta_1 \cdot \mbox{height} $$ -- **Notes** * The capital `\(Y\)` indicates that it has a distribution. * `\(x\)` is lower case because we are imagining a specific value there. * So for each value of `\(x\)`, the there is a distribution of `\(Y\)`'s. * Prefer longer names for specific examples; use `\(x\)` and `\(Y\)` for "generic examples". * Statisticians like to use `\(\beta_0\)` and `\(\beta_1\)` for intercept and slope. --- ## Claim 2 is about distribution **Claim 2:** But **not everyone is average**: * weights of people with a given height are symmetrically distributed around the average weight for that height. * the distribution is normal (bell-shaped). We express this as $$ Y \mid x \sim {\sf Norm}(\mu_{Y|x}, \sigma) $$ * `\(\sim\)` = "is distributed as" --- ## All together now Putting this all together, and being a little bit sloppy we might write it this way: `\begin{align*} Y &\sim {\sf Norm}(\mu, \sigma) \\ \mu &= \beta_0 + \beta_1 x \end{align*}` In this style the dependence of `\(Y\)` on `\(x\)` is implicit (via `\(\mu\)`'s dependence on `\(x\)`) and we save writing `\(\mid x\)` in a few places. This model has 3 **parameters**: `\(\beta_0\)`, `\(\beta_1\)`, `\(\sigma\)`. * `\(\mu\)` can be avoided if we write this as $$ Y \sim {\sf Norm}(\beta_0 + \beta_1 x, \sigma) $$ --- ## Not done yet -- more distributions A **prior distribution** describes what is known/believed about the parameters before we use the information from our data. For this example, so our full model looks something like `\begin{align*} Y &\sim {\sf Norm}(\mu, \sigma) \\ \mu &= \beta_0 + \beta_1 x \\ \beta_0 & \sim \ ??? \\ \beta_1 & \sim \ ??? \\ \sigma & \sim \ ??? \end{align*}` We'll be learning how to select reasonable priors for parameters as we go along. <!-- - we use very flat broad priors --> <!-- - centered at 0 for the `\(\beta\)`'s --> <!-- - extending from 0 to a very large number for `\(\sigma\)` --> --- ## Posterior For now we won't worry about how the posterior distribution is computed, but we can inspect it visually and numerically. (It is called `Post` in the R code below.) <!-- For example, if we are primarily interested in --> <!-- the slope (how much heavier are people on average for each inch they are taller?), --> <!-- we can plot the posterior distribution of `\(\beta_1\)` or calcuate its mean, --> <!-- or the region containing the central 95% of the distribution. --> <!-- Such a region is called a **highest density interval** (HDI) --> <!-- (sometimes called the highest posterior density interval (HPDI), to emphasize --> <!-- that we are looking at a posterior distribution, but an HDI can be computed for --> <!-- other distributions as well). --> ```r gf_density( ~ b_height, data = Post, alpha = 0.5) | gf_density( ~ b_Intercept, data = Post, alpha = 0.5) | gf_density( ~ sigma, data = Post, alpha = 0.5) ``` <img src="Stat341-Day01_files/figure-html/ch02-nhanes-post-01a-1.png" style="display: block; margin: auto;" /> ```r df_stats(~ b_height + b_Intercept + sigma, data = Post, mean, sd) ``` ``` ## response mean sd ## 1 b_height 7.236983 0.9033625 ## 2 b_Intercept -295.504391 61.3997226 ## 3 sigma 43.324469 2.5384608 ``` --- ## Posterior ```r hdi(Post, pars = "b_height") ``` ``` ## par lo hi mode prob ## 1 b_height 5.436719 9.021818 7.349414 0.95 ``` ```r mcmc_areas(as.mcmc(Post), pars = "b_height", prob = 0.95) ``` <img src="Stat341-Day01_files/figure-html/ch02-nhanes-post-01c-1.png" style="display: block; margin: auto;" /> <!-- Although we don't get a very precise estimate of `\(\beta_1\)` from this model/data --> <!-- combination, we can be quite confident that taller people are indeed --> <!-- heavier (on average), somewhere between 5 and 10 pounds heavier per inch taller. --> --- ## More Posterior Distributions Posterior distributions for parameters provide posterior distributions for other things as well. * Posterior distribution for the lines (using `\(\beta_0\)`, and `\(\beta_1\)`) * Posterior distribution for the weight of a person who is 6 feet tall (using `\(\beta_0\)`, `\(\beta_1\)`, `\(\sigma\)` and normal distributions) * etc. Typically we learn about these via **posterior sampling** -- randomly selecting values from the posterior distribution for the parameters and combining them to get what we are interested in. --- ## Posterior Lines * Each line represents a plausible (according to the posterior distribution) combination of slope and intercept. * 100 such lines are included in the plot below. <img src="Stat341-Day01_files/figure-html/ch02-nhanes-lines-01-1.png" style="display: block; margin: auto;" /> * Note: These lines represent what we know about **average** weights, and many people are not average. --- ## Posterior Predictive Check A posterior predictive check is a way of checking that the data look like they could have been plausibly generated by our model. To generate a simulated weight for a given height: - select a value for `\(x\)` (height). - randomly select `\(\beta_0\)`, `\(\beta_1\)`, `\(\sigma\)` from posterior. - use `\(\beta_0\)` and `\(\beta_1\)` to generate average weight. - use using the normal distribution (ie, `\(\sigma\)`) to generate a difference between an individual weight and the average weight. --- ## Posterior Predictive Check -- Computation For example, here are the first two rows of our posterior distribution: ```r Post %>% head(2) ``` ``` ## b_Intercept b_height sigma ## 1 -226.2681 6.205661 39.30596 ## 2 -239.8352 6.495962 42.08519 ``` Simulated weight: * row 1: select from `\({\sf Norm}(-226.2681 + 6.205661 \cdot 65, 39.30596)\)` * row 2: select from `\({\sf Norm}(-239.8352 + 6.495962 \cdot 65, 42.08519)\)` --- ## Posterior Predictive Check -- Computation Simulated weight: * row 1: select from `\({\sf Norm}(-226.2681 + 6.205661 \cdot 65, 39.30596)\)` * row 2: select from `\({\sf Norm}(-239.8352 + 6.495962 \cdot 65, 42.08519)\)` ```r Post %>% head(2) %>% mutate(pred_y = rnorm(2, mean = b_Intercept + b_height * 65, sd = sigma)) ``` ``` ## b_Intercept b_height sigma pred_y ## 1 -226.2681 6.205661 39.30596 162.9508 ## 2 -239.8352 6.495962 42.08519 122.4926 ``` --- ## Posterior Predictive Check -- Plot If we do this many times for several height values and plot the central 95% of the weights, we get a plot that looks like this: <img src="Stat341-Day01_files/figure-html/ch02-nhanes-post-06-1.png" style="display: block; margin: auto;" /> <!-- Now we see that indeed, most (but not all) of the data points fall --> <!-- within a range that the model believes is credible. If this were not the --> <!-- case, it would be evidence that our model is not well aligned with the data --> <!-- and might lead us to explore other models. --> By taking many different credible values of the parameters (including `\(\sigma\)`), we are taking into account * our uncertainty about the parameter values, and * the variability the model describes in the population (even for given parameter values). --- ## Where do we go from here? Now that we have seen an overview of Bayesian inference at work, you probably have lots of questions. Over time we will improve our answers to each of them. -- 1. How do we create models? 2. How do we select priors? 3. How do we update the prior based on data to get the posterior? 4. How do we tell whether the algorithm that generated the posterior worked well? 5. How do we tell whether the model worked well? 6. What can we do with the posterior once we have it?