Goals for today

Goal for the day

  1. Design and fit a simple linear model.

  2. Use it to create this plot.

  3. Start thinking about why multiple regression will be important.

Interestingly, this model is quite confident that there is a positive association between the number of waffle houses and divorce rates. Does building a waffle house cause people to get divorced? That doesn’t sound right. So what explains this association? We’ll come back to this in a bit, but first, let’s see how the model depicted above was designed and fit.

Waffles and Divorce: Designing a simple linear model

The details of this model are not presented in the text, so we get to design it ourselves. Here is the outline of our model

Outline of a model

\[\begin{align*} D &\sim {\sf Norm}(\mu, \sigma) \\ \mu &= \beta_0 + \beta_1 W \\ \beta_0 &\sim ?? \\ \beta_1 &\sim ?? \\ \sigma &\sim ?? \\ \end{align*}\]

where \(D\) and \(W\) are the standardized divorce rate and number of waffle houses per million people.

Priors

Now we need some priors.

  • \(\beta_0\) is the average value of \(D\) when \(W\) is average. Since \(D\) is standardized, this is likely to be pretty close to 0. So

\[\beta_0 \sim {\sf Norm}(0, 0.2)\]

  • You may recall from another course that the slope of the least squares regression line is \(r \frac{s_y}{s_x}\). Since \(-1 \le r \le 1\) and both variables have been standardized to have standard deviation of 1, the slope should be between -1 and 1.

\[\beta_1 \sim {\sf Norm}(0, 0.3)\]

  • The standard deviation of standardized divorce rate is 1. The standard deviation for a particular waffle house density is likely at least somewhat smaller. (It would be a lot smaller if waffle houses were a strong predictor of divorce rate.) An exponential distribution with mean of 1/2 might be a reasonable prior.

\[\sigma \sim {\sf Exp}(2)\] Notice how standardization helped us design each of these priors.

\[\begin{align*} D &\sim {\sf Norm}(\mu, \sigma) \\ \mu &= \beta_0 + \beta_1 W \\ \beta_0 &\sim {\sf Norm}(0, 0.2) \\ \beta_1 &\sim {\sf Norm}(0, 0.3) \\ \sigma &\sim {\sf Exp}(2) \\ \end{align*}\]

Checking priors with prior predictive modeling

We can do some prior predictive sampling to make sure that these seem reasonable (and make any adjustments that might be necessary).

PriorSamples <-
  tibble(
    beta_0 = rnorm(200, 0, 0.2),
    beta_1 = rnorm(200, 0, 0.3),
    sigma  = rexp(200, 2)
  )

gf_abline(slope = ~ beta_1, intercept = ~beta_0, 
            data = PriorSamples, alpha = 0.2) %>%
  gf_lims(x = c(-3, 3), y = c(-3, 3)) %>%
  gf_abline(slope = ~ c(-1, 1), intercept = ~c(0,0), 
            data = NA,
            inherit = FALSE,
            color = "red", linetype = "dotted")

The steepest lines shown exhibit nearly perfect correlation between divorce rates and waffle houses, but most of them have modest positive or negative slopes. So our prior appears to be giving a reasonable range of plausible lines.

As it turns out, we can also extract prior samples from a quap model, just as easily as posterior samples.

data(WaffleDivorce)
Waffles <-
  WaffleDivorce %>%
  mutate(
    WaffleHousesPerCap = WaffleHouses / Population,
    D = standardize(Divorce),
    W = standardize(WaffleHousesPerCap),
    A = standardize(MedianAgeMarriage),
    M = standardize(Marriage)
  )
   
m5.0 <- quap(
  data = Waffles,
  alist(
    D ~ dnorm(mu, sigma),
    mu <- beta_0 + beta_1 * W,
    beta_0 ~ dnorm(0, 0.2),
    beta_1 ~ dnorm(0, 0.3),
    sigma ~  dexp(2)
  )
)
PriorSamples2 <- m5.0 %>% extract.prior(n = 200) %>% as.data.frame()

gf_abline(slope = ~ beta_1, intercept = ~beta_0, 
            data = PriorSamples2, alpha = 0.2) %>%
  gf_lims(x = c(-3, 3), y = c(-3, 3)) %>%
  gf_abline(slope = ~ c(-1, 1), intercept = ~c(0,0), 
            data = NA,
            inherit = FALSE,
            color = "red", linetype = "dotted")

Dealing with standardization

It would be easier to think about those lines on the natural scale. We can convert the intercept just like any other value of the response. We make this simpler using unstandardize(). The slope \(\beta_1\) needs to be multiplied by the ratio of the standard deviations.

gf_abline(
  slope = ~ beta_1 * sd(Waffles$Divorce) / sd(Waffles$WaffleHousesPerCap), 
  intercept = ~ mean(Waffles$Divorce) + beta_0 * sd(Waffles$Divorce), 
  data = PriorSamples, alpha = 0.2) %>%
  gf_lims(y = c(0, 20), x = c(0, 50))

Looks like all of the divorce rates stay in a plausible range (none are negative, for example) over the range of waffle house densities seen in our data. So our priors seem reasonable enough.

Looking at and improving standardize() and unstandardize()

rethinking::standardize
## function (x) 
## {
##     x <- scale(x)
##     z <- as.numeric(x)
##     attr(z, "scaled:center") <- attr(x, "scaled:center")
##     attr(z, "scaled:scale") <- attr(x, "scaled:scale")
##     return(z)
## }
## <bytecode: 0x7fd4b6697a20>
## <environment: namespace:rethinking>
rethinking::unstandardize
## function (x) 
## {
##     scale <- attr(x, "scaled:scale")
##     center <- attr(x, "scaled:center")
##     z <- x * scale + center
##     return(as.numeric(z))
## }
## <bytecode: 0x7fd4b659fe80>
## <environment: namespace:rethinking>

That’s slick: the mean and standard deviation get stored (as “scaled:center” and “scaled:scale”), so we can access them later when unstandardizing.

But what if we want to standardize or unstandardize new data (like a sequence of points along the range of our data)? With a little bit of adjustment, we can make that possible.

standardize <- 
  function (
    x, ref = x, center = mean(ref), scale = sd(ref)
  ) 
  {
    x <- scale(x, center = center, scale = scale)
    z <- as.numeric(x)
    attr(z, "scaled:center") <- attr(x, "scaled:center")
    attr(z, "scaled:scale") <- attr(x, "scaled:scale")
    z
  }

unstandardize <- 
  function (
    x,  ref = x, 
    scale = attr(ref, "scaled:scale"), 
    center = attr(ref, "scaled:center")
  ) 
  {
    res <- x * scale + center
    as.numeric(res)
  }

That only helps for our intercept here, but will be useful for other things shortly.

gf_abline(
  slope = ~ beta_1 * sd(Waffles$Divorce) / sd(Waffles$WaffleHousesPerCap), 
  intercept = ~ unstandardize(beta_0, Waffles$D),
  data = PriorSamples, alpha = 0.2) %>%
  gf_lims(y = c(0, 20), x = c(0, 50))

Inspecting the Posterior

Posterior distribution of slope

WafflePost <- m5.0 %>% extract.samples(n = 1e4)
gf_dens( ~ beta_1, data = WafflePost)

prop( ~ (beta_1 < 0), data = WafflePost)
## prop_TRUE 
##    0.0041
tally( ~ (beta_1 < 0), data = WafflePost, format = "prop")
## (beta_1 < 0)
##   TRUE  FALSE 
## 0.0041 0.9959
precis(m5.0)
##                 mean         sd       5.5%     94.5%
## beta_0 -6.136869e-07 0.10794351 -0.1725152 0.1725140
## beta_1  3.086196e-01 0.11929114  0.1179693 0.4992698
## sigma   9.066690e-01 0.08857643  0.7651068 1.0482313


Creating the plot

Link5.0 <- link(m5.0)
Avg5.0 <- apply(Link5.0, 2, mean_hdi, .width = 0.9) %>%
  bind_rows() %>%
  bind_cols(Waffles)
  
gf_point(Divorce ~ WaffleHousesPerCap, data = Waffles) %>%
  gf_ribbon(unstandardize(ymin, D) + unstandardize(ymax, D) ~ WaffleHousesPerCap, 
            data = Avg5.0, inherit = FALSE, fill = "steelblue") %>%
  gf_labs(
    caption = "blue band is 90% credible interval for average response",
    y = "Divorces per 1000 Marriages", 
    x = "Waffle Houses per Million People")


Multiple Regression: What? Why? How?

Simply put, multiple regression is regression with multiple explanatory variables. These go by other names as well, like predictors, regressors, etc. One name I do not like is “independent variables”. Please do not use the term. Why? Because independence is an important statistical term that means something completely different, and typically predictors are not independent (in the statistical sense) of each other or the response variable.

Our goal over the next several sessions will be to learn how to design, fit, and interpret multiple regression models. Much of this work would be the same even in a non-Bayesian class. So if you have seen multiple regression before, you will be adding to what you already know. And if you see multiple regression later, even in a non-Bayesian context, you will be able to apply things you are learning here. Of course, we will use the Bayesian framework for the regression models in this course.

It is the design and interpretation that will be the challenging part. These models are not technically much more challenging to fit than the models we have been using, and eventually we will learn some tools that make describing the model even easier (but come with a different cost – learning about how MCMC algorithms work and might fail).

Quote of the day (from Chapter 5 of McElreath):

Multiple regression can be worse than useless.

That sounds a bit scary, but all it is saying is what I just said: the hard part is not getting the computer to fit these models, it is designing and interpreting the models. And used without care, multiple regression models can lead us in the wrong direction.

What’s the goal?

There are at least two goals we could have for our model:

  1. Make predictions.

  2. Understand causal relationships.

A good model for one of these goals might not be the best model for the other goal. This is one of the little paradoxes of regression.