Goals for today

Goal for the day

Today we will begin learning about multiple regression models. In particular, we will learn how to

  1. Design multiple regression models.

  2. Fit multiple regression models.

  3. Interpret multiple regression models.

We won’t learn everything about these three topics in one day, but we will make a start.

Of these three items, 2 is the most straightforward and least interesting. Fitting a multiple regression model is very similar to fitting a polynomial regression model. Instead of having terms like x, x^2, x^3, etc, we will have different predictors. But the model engine can’t really tell the difference btween x^2 and some new predictor.

So most of our time will be spent focusing on items 1 and 3.

The Waffle House data

For now, we are going to leave the Waffle House density and look at two other potential predictors of divorce rate: median age at marriage and marriage rate.

Simple scatter plots show that both are associated with divorce rate. They are also associated with each other.

data(WaffleDivorce)
Waffles <-
  WaffleDivorce %>%
  mutate(
    WaffleHousesPerCap = WaffleHouses / Population,
    D = standardize(Divorce),
    W = standardize(WaffleHousesPerCap),
    A = standardize(MedianAgeMarriage),
    M = standardize(Marriage)
  )
gf_point(Divorce ~ MedianAgeMarriage, data = Waffles) |
gf_point(Divorce ~ Marriage, data = Waffles) |
gf_point(Marriage ~ MedianAgeMarriage, data = Waffles)

This means that knowing any one (or two) of these is useful for predicting another. The predictions won’t be perfect, they will be better than predictions we could make without that information.

But how do we figure out which things my be directly causing which others? If our goal were to reduce the divorce rate, what sorts of policies could we enact? Should we try to increaese the median age of marriage? Should we try to decrease the marriage rate? Is there something else entirely that we should target?

DAGs and Causes

DAG = Directed Acyclic Graph

  • Used for many things.

  • Here: vertices will be variables and arrows will indicate potential direct causal relationships.

Here are two potential DAGs for our situation.

In both, we could see an association between A and D. In the second there is a direct link between A and D; in the first there are both direct and indirect links.

So which of these might be correct? Can we rule one out? Can we say whether the impact of A on D is primarily direct or primarily indirect?

Some important things to notice.

  1. Fitting a model (m5.1) with

\[\begin{align*} D &\sim {\sf Norm}(\mu, \sigma)\\ \mu &= \beta_0 + \beta_A A \end{align*}\]

won’t give use enough information to tell. Such a model only picks up on the total influence of A ond D; it doesn’t distinguish between direct and indirect influence.

  1. Additionally fitting a model (m5.2) with

\[\begin{align*} D &\sim {\sf Norm}(\mu, \sigma)\\ \mu &= \beta_0 + \beta_M M \end{align*}\]

also won’t help us. This will pick up on the total influence of M on D, including any influence being driven by A’s influence on M. In other words, even the second DAG predicts an association between M and D that m5.2 will discover. The lack of an arrow in that DAG says there is no direct causal link, it does not say there is no association.

So what is the difference between the two DAGs?

  • The first says that if we could change M without changing A, D could change because of the direct influence of M on D.

  • The second says that if we could change M without changing A, D would not change.

    The fancy language for this is D is independent of M, conditional on A, which we can denote as \[D \perp \!\!\! \perp M \mid A\] To see whether this is a good match to our data, we need to fit a model that includes both M and A as predictors and then figure out what that model has to say.

Fitting Multiple Regression Models

m5.1 <- quap(
  data = Waffles,
  alist(
    D ~ dnorm(mu, sigma),
    mu <- b_0 + b_A * A,
    b_0 ~ dnorm(0, 0.2),
    b_A ~ dnorm(0, 0.5),
    sigma ~  dexp(1)
  )
)

m5.2 <- quap(
  data = Waffles,
  alist(
    D ~ dnorm(mu, sigma),
    mu <- b_0 + b_M * M,
    b_0 ~ dnorm(0, 0.2),
    b_M ~ dnorm(0, 0.5),
    sigma ~  dexp(1)
  )
)

m5.3 <- quap(
  data = Waffles,
  alist(
    D ~ dnorm(mu, sigma),
    mu <- b_0 + b_A * A + b_M * M,
    b_0 ~ dnorm(0, 0.2),
    b_A ~ dnorm(0, 0.5),
    b_M ~ dnorm(0, 0.5),
    sigma ~  dexp(1)
  )
)

Comparing Coeffecients

What do \(\beta_A\) and \(\beta_M\) tell us?

Here are some 89% HDIs for these coefficients in three models.

plot(coeftab(m5.1, m5.2, m5.3))

plot(coeftab(m5.1, m5.2, m5.3), par = c("b_A", "b_M"))

Residuals

\[ \mbox{residual} = \mbox{observed} - \mbox{predicted} \]

Since our prediction depends on the parameters, which have a posterior distribution, residuals also have a posterior distribution, which we can summarize with a mean, or median, or HDI. The residuals represent what the model missed, the “left-over”.

We can investigate residuals from any model simply by comparing predictions to observations. One set of residuals that can be useful for understanding our model is the residuals we get when we predict one predictor from another. We can then ask how well these residuals predict the original response. That is, how much does information in one predictor that is not captured by the other help us predict the response.

m5.4 <- quap(
  data = Waffles,
  alist(
    M ~ dnorm(mu, sigma),
    mu <- b_0 + b_A * A, 
    b_0 ~ dnorm(0, 0.2),
    b_A ~ dnorm(0, 0.5),
    sigma ~  dexp(1)
  )
)

m5.4a <- quap(
  data = Waffles,
  alist(
    A ~ dnorm(mu, sigma),
    mu <- b_0 + b_M * M, 
    b_0 ~ dnorm(0, 0.2),
    b_M ~ dnorm(0, 0.5),
    sigma ~  dexp(1)
  )
)
Link5.4 <- link(m5.4)
Post5.4 <-
  Link5.4 %>% apply(2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Waffles) %>%
  mutate(resid = M - y)

p1 <- Post5.4 %>%
  # gf_pointrange(y + ymin + ymax ~ A, color = "steelblue", size = 1.2, alpha = 0.6) %>%
  gf_segment(y + M ~ A + A, color = "steelblue", alpha = 0.6) %>%
  gf_point(M ~ A) %>%
  gf_line(y ~ A)

p2 <- Post5.4 %>%
  gf_point(D ~ resid) %>%
  gf_text(D ~ resid, label = ~Loc, data = Post5.4 %>% filter(abs(resid) > 1.1),
          alpha = 0.8, size = 5, color = "steelblue", vjust = 0) %>%
  gf_labs(
    x = "age at marriage residuals"
  )

Link5.4a <- link(m5.4a)
Post5.4a <-
  Link5.4a %>% apply(2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Waffles) %>%
  mutate(resid = A - y)

p3 <- Post5.4a %>%
  # gf_pointrange(y + ymin + ymax ~ A, color = "steelblue", size = 1.2, alpha = 0.6) %>%
  gf_segment(y + A ~ M + M, color = "steelblue", alpha = 0.6) %>%
  gf_point(A ~ M) %>%
  gf_line(y ~ M)

p4 <- Post5.4a %>%
  gf_point(D ~ resid) %>%
  gf_text(D ~ resid, label = ~Loc, data = Post5.4a %>% filter(abs(resid) > 1.1),
          alpha = 0.8, size = 5, color = "steelblue", vjust = 0) %>%
  gf_labs(
    x = "marriage rate residuals"
  )
(p1 / p2) | (p3 / p4)

Posterior Predictive Plots

Avg5.3 <-
  link(m5.3) %>%
  apply(2, mean_hdi) %>%
  bind_rows() %>%
  bind_cols(Waffles) 

Avg5.3 %>%
  gf_pointrange(y + ymin +ymax ~ D) %>%
  gf_abline(slope = ~1, intercept = ~0, inherit = FALSE,
            linetype = "dotted", color = "red", alpha = 0.8) %>%
  gf_text(y ~ D, label = ~ Loc, size = 8,
          data = Avg5.3 %>% filter(abs(y - D) > 1),
          color = "steelblue", alpha = 0.8) %>%
  gf_labs(
    x = "divorce rate (std)",
    y = "predicted divorce rate (std)"
  )

Categorical Variables

We can use categorical variables as well as continuous variables in multiple regression. We have seen one example of this before:

library(rethinking)
data(Howell1)

HowellAdults <-
  Howell1 %>%
  filter(age >= 18) %>%
  mutate(weight_s = standardize(weight))

m4.4a <- 
  quap(
    data = HowellAdults,
    alist(
      height ~ dnorm(mu, sigma),
      mu    <- a_0 + b_1 * weight_s + b_2 * male,
      a_0    ~ dnorm(178, 20),
      b_1    ~ dlnorm(0, 1),
      b_2    ~ dnorm(0, 10),
      sigma  ~ dunif(0, 50)
    )
  )

precis(m4.4a)
##             mean        sd       5.5%      94.5%
## a_0   151.545752 0.3374574 151.006430 152.085075
## b_1     4.105255 0.2673816   3.677928   4.532583
## b_2     6.516782 0.5332875   5.664486   7.369079
## sigma   4.254026 0.1603384   3.997774   4.510277

But there is another way we could do this.

HowellAdults <-
  HowellAdults %>% 
  mutate( sex = 1 + male)    # convert from 0/1 to 1/2

m4.4b <- 
  quap(
    data = HowellAdults,
    alist(
      height ~ dnorm(mu, sigma),
      mu    <- a[sex] + b * weight_s,
      a[sex] ~ dnorm(178, 20),
      b      ~ dlnorm(0, 1),
      sigma  ~ dunif(0, 50)
    )
  )

precis(m4.4b, depth = 2)
##             mean        sd       5.5%      94.5%
## a[1]  151.536006 0.3377269 150.996253 152.075758
## a[2]  158.078967 0.3631539 157.498577 158.659358
## b       4.098354 0.2674908   3.670853   4.525856
## sigma   4.254090 0.1603452   3.997828   4.510353

So we have two ways:

  1. indicator variable approach – use a 0/1 variable to indicate whether an individual belongs to a group.

  2. index variable approach – use a variable that provides an index to the group number.

precis(m4.4a)
##             mean        sd       5.5%      94.5%
## a_0   151.545752 0.3374574 151.006430 152.085075
## b_1     4.105255 0.2673816   3.677928   4.532583
## b_2     6.516782 0.5332875   5.664486   7.369079
## sigma   4.254026 0.1603384   3.997774   4.510277
precis(m4.4b)
## 2 vector or matrix parameters hidden. Use depth=2 to show them.
##           mean        sd     5.5%    94.5%
## b     4.098354 0.2674908 3.670853 4.525856
## sigma 4.254090 0.1603452 3.997828 4.510353
extract.samples(m4.4b) %>% 
  as_tibble() %>% 
  mutate(b2 = a[, 1] - a[, 2]) %>%
  mean_hdi(b2)
## # A tibble: 1 x 6
##      b2 .lower .upper .width .point .interval
##   <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
## 1 -6.55  -7.59  -5.49   0.95 mean   hdi

Each of these can be generalized to categorical variables with more than two levels. It is easy to see how this would work for index variables. For the indicator variable approach, we would need to introduce \(k-1\) index variables where \(k\) is the number of levels. The latter approach is commonly used in frequentist statistics, but the indicator variable approach often works better in our setting, and it is easier to generalize to hiearchical models later. You should be familiar with both approaches.