Goals for today

  1. Fit models illustrating collider bias

  2. Some things along the way

    • tribble() for row-wise descriptions of a tibble.
    • CalvinBayes::gf_dag() for pretty plots of DAGs
    • Creating simulations

This material is based on Rethinking 6.3.

DAG of the day

  • The highlighted node C is appropriately called a collider.
  • Colliders can be embedded in larger DAGs.
  • Collliders have consequences for how we interpret our models.

Example: trustworthiness/newsworthiness

  • T: trustworthiness of a proposal
  • N: newsworthiness of a proposal
  • S: whether the proposal is selected

A reasonable DAG:

library(ggdag)
dag_coords <-
  tribble(~name, ~x, ~ y,
          "T", 1, 1,
          "S", 2, 1,
          "N", 3, 1 )

dagify(S ~ T + N, coords = dag_coords) %>%
  gg_dag()
  • Higher trustworthiness \(\to\) higher chance of selection
  • Higher newsworthiness \(\to\) higher chance of selection

This makes S a collider.

  • This induces a (non-causal) negative association between trustworthiness and newsworthiness. Do you see why?

Here’s how McElreath explains it:

The fact that two arrows enter S means it is a collider. The core concept is easy to understand: When you condition on a collider, it creates statistical–but not necessarily causal–associations among its causes. In this case, once you learn that a proposal has been selected (\(S\)), then learning its trustworthiness (\(T\)) also provides information about its newsworthiness (\(N\)). Why? Because if, for example, a selected proposal has low trustworthiness, then it must have high newsworthiness. Otherwise it wouldn’t have been funded. The same works in reverse: If a proposal has low newsworthiness, we’d infer that it must have higher than average trustworthiness. Otherwise it would not have been selected for funding. (p. 176. emphasis in the original)

Example: Collider of false sorrow

Let’s consider another example where happiness and age both directly influence whether someone gets married, but getting married doesn’t change your age or your happiness.

  • H: happiness
  • M: marriage
  • A: age

Here’s the new DAG – basically the same as in the previous example.

# instead of building the coordinates from scratch, we can
# just relabel our previous coordinates using mutate()

dagify(M ~ H + A,
       coords = dag_coords %>%
         mutate(name = c("H", "M", "A"))) %>%
  gg_dag()

In this made-up example,

happiness (\(H\)) and age (\(A\)) both cause marriage (\(M\)). Marriage is therefore a collider. Even though there is no causal association between happiness and age, if we condition on marriage–which means here, if we include it as a predictor in a regression–then it will induce a statistical association between age and happiness. And this can mislead us to think that happiness changes with age, when in fact it is constant.

To convince you of this, let’s do another simulation. (pp. 176–177)

We could use McElreath’s rethinking::sim_happiness(), or we could redo that computation using code the looks more familiar to us. Either way, the simulation works as follows: Each year

  1. Everyone gets one year older.
  2. Some adults marry; happier people are more likely to marry.
  3. Old people die or move away.
  4. New people are born with a variety of happiness ratings.
# Return a data frame with n 1-year-olds of varying happiness
new_borns <- function(n = 20) {
  tibble(
    A = 1,   # 1 year old
    M = 0,   # not married
    H = seq(-2, 2, length.out = n)   # range of happiness scores
  )
}

# Everyone ages;
# Some people get married;
# Old people die;
# New people are born.
update_population <- function(
  pop, N_births = 20, aom = 18, max_age = 65) {
  
  pop %>%
    mutate(
      A = A + 1, # everyone gets one year older
      # some people get married
      M = ifelse(M >= 1, 1, (A >= aom) * rbinom(nrow(pop), 1, inv_logit(H - 4)))
    ) %>%
    filter(A <= max_age) %>%    # old people die
    bind_rows(                  # new people are born
      new_borns(N_births)
    )
}

# Run the population simulation for 1000 years.
set.seed(1977)
Happy <- new_borns(20)

# one of the few situations where a for loop in R makes sense
for(i in 2:1000) {
  Happy <- update_population(Happy, aom = 18, max_age = 65, N_births = 20)
}
Happy <- Happy %>% rename(age = A, married = M, happiness = H)
Happy %>% reactable::reactable() 

Summarize the variables.

df_stats( ~ married  + age + happiness, data = Happy) %>% pander()
response min Q1 median Q3 max mean sd n missing
married 0 0 0 1 1 0.2977 0.4574 1300 0
age 1 17 33 49 65 33 18.77 1300 0
happiness -2 -1 -1.11e-16 1 2 -1e-16 1.214 1300 0

Here’s our version of Figure 6.5.

Happy %>% 
  mutate(
    married = factor(married, labels = c("unmarried", "married"))
  ) %>% 
  gf_point(happiness ~ age, color = ~married, size = 1.75) %>%
  gf_refine(
    scale_color_manual(NULL, values = c("grey85", "forestgreen")),
    scale_x_continuous(expand = c(.015, .015)) 
  ) %>%
  gf_theme(panel.grid = element_blank())

Here’s the likelihood for the simple Gaussian multivariable model predicting happiness from age and marriage status:

\[ \begin{align*} \text{happiness}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha_{\text{married} [i]} + \beta_1 \text{age}_i ,\\ \end{align*} \]

where \(\text{married}[i]\) is the marriage status of individual \(i\). Here we make Happy18, the subset of Happy containing only those 18 and up. We then make a new age variable, a, which is scaled such that \(18 = 0\), \(65 = 1\), and so on.

Happy18 <-
  Happy %>% 
  filter(age > 17) %>% 
  mutate(
    A = (age - 18) / (65 - 18),
    midx = married + 1    #  0/1 -> 1/2
  )
df_stats(~ age + A + married + midx + happiness, data = Happy18) %>%
  pander::pander()
response min Q1 median Q3 max mean sd n missing
age 18 29.75 41.5 53.25 65 41.5 13.86 960 0
A 0 0.25 0.5 0.75 1 0.5 0.2949 960 0
married 0 0 0 1 1 0.4031 0.4908 960 0
midx 1 1 1 2 2 1.403 0.4908 960 0
happiness -2 -1 -1.11e-16 1 2 -1e-16 1.215 960 0

Now let’s fit two models: one with both predictors and one without marriage status.

With respect to priors,

> happiness is on an arbitrary scale, in these data, from $-2$ to $+2$. So our imaginary strongest relationship, taking happiness from maximum to minimum, has a slope with rise over run of $(2 - (-2))/1 = 4$. Remember that 95% of the mass of a normal distribution is contained within 2 standard deviations. So if we set the standard deviation of the prior to half of 4, we are saying that we expect 95% of plausible slopes to be less than maximally strong. That isn't a very strong prior, but again, it at least helps bound inference to realistic ranges. Now for the intercepts. Each $\alpha$ is the value of $\mu_i$ when $A_i = 0$. In this case, that means at age 18. So we need to allow $\alpha$ to cover the full range of happiness scores. $\operatorname{Normal}(0, 1)$ will put 95% of the mass in the $-2$ to $+2$ interval. (p. 178)
u6.9 <- 
  ulam(
    data = Happy18,
    alist(
      happiness ~ dnorm(mu, sigma),
      mu <- a[midx] + ba * A,
      a[midx] ~ dnorm(0, 1),
      ba ~ dnorm(0, 2),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.9")
stanfit(u6.9) %>% 
  mcmc_areas(pars = vars(-lp__))

precis(u6.9, depth = 2) %>% pander()
  mean sd 5.5% 94.5% n_eff Rhat4
a[1] -0.1843 0.06751 -0.2914 -0.07692 1633 1.004
a[2] 1.247 0.08877 1.107 1.389 1685 1.003
ba -0.769 0.1178 -0.9605 -0.5857 1581 1.004
sigma 1.018 0.02311 0.9815 1.055 2025 1.001

Now drop marriage status, midx.

u6.10 <- 
  ulam(
    data = Happy18,
    alist(
      happiness ~ dnorm(mu, sigma),
      mu <- a + ba * A,
      a ~ dnorm(0, 1),
      ba ~ dnorm(0, 2),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.10")
stanfit(u6.10) %>%
  mcmc_areas(pars = vars(-lp__))

precis(u6.10) %>% pander()
  mean sd 5.5% 94.5% n_eff Rhat4
a -0.002135 0.07448 -0.1236 0.1164 1501 1.004
ba 0.004196 0.1283 -0.1997 0.2091 1480 1.005
sigma 1.216 0.02808 1.173 1.262 2049 1.001

Wow. So when we take out marriage status, the coefficient for age drops to zero.

The pattern above is exactly what we should expect when we condition on a collider. The collider is marriage status. It is a common consequence of age and happiness. As a result, when we condition on it, we induce a spurious association between the two causes. So it looks like, to model [u6.9], that age is negatively associated with happiness. But this is just a statistical association, not a causal association. Once we know whether someone is married or not, then their age does provide information about how happy they are. (p. 179)

A little further down, McElreath gets rough:

It’s easy to plead with this example. Shouldn’t marriage also influence happiness? What if happiness does change with age? But this misses the point. If you don’t have a causal model, you can’t make inferences from a multiple regression. And the regression itself does not provide the evidence you need to justify a causal model. Instead, you need some science. (pp. 179–180, emphasis added)

Example: The haunted DAG

It gets worse.

Unmeasured causes can still induce collider bias. So I’m sorry to say that we also have to consider the possibility that our DAG may be haunted" (p. 180, emphasis added).

Example: Let’s consider the education level in 3 generates of a family: grand parents, parents, and children.

Here’s the unhaunted DAG.

dag_coords <-
  tribble(~name, ~x, ~ y,
          "G", 1, 2,
          "P", 2, 2,
          "C", 2, 1)
dagify(P ~ G,
       C ~ P + G,
       coords = dag_coords) %>%
  gg_dag()

This says

  • Grandparents’ education affects parents’ education.
  • Grandparents’ education affects children’s education.
  • Parents’ education affects children’s education.

These all seem reasonable – parents with high education levels probably encourange and support their children and grandchildren to also have high levels of education.

Now we add the haunting variable, U (U for unmeasured).

dag_coords <-
  tribble( ~name, ~x, ~y,
           "G", 1, 2,
           "P", 2, 2,
           "C", 2, 1,
           "U", 2.5, 1.5)

dagify(P ~ G + U,
       C ~ P + G + U,
       coords = dag_coords) %>%
  gg_dag(highlight = "U")

This says that there is something else (perhaps the neighborhood where someone lives) that affects education level of both parents and children.

Let’s simulate some data where we assign values to the size of these various effects. Note that we are setting the direct affect of grandparents on children to 0.

family_sim <- function(
  n,
  b_gp = 1,  # direct effect of G on P
  b_gc = 0,  # direct effect of G on C
  b_pc = 1,  # direct effect of P on C
  b_up = 2,  # direct effect of U on P 
  b_uc = 2,  # direct effect of U on C
  seed = 1
) {
  set.seed(seed)
  tibble(
    u = 2 * rbinom(n, size = 1, prob = .5) - 1,  # 1 or -1
    g = rnorm(n, mean = 0, sd = 1)) %>% 
    mutate(p = rnorm(n, mean = b_gp * g + b_up * u, sd = 1)) %>% 
    mutate(c = rnorm(n, mean = b_pc * p + b_gc * g + b_uc * u, sd = 1))
}

Families <- family_sim(200)
u6.11 <-
  ulam(
    data = Families,
    alist(
      c ~ dnorm(mu, sigma),
      mu <- a + b_pc * p + b_gc * g,
      a ~ dnorm(0, 1),
      c(b_pc, b_gc) ~ dnorm(0,1),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.11")
stanfit(u6.11) %>%
  mcmc_areas(pars = vars(-lp__))

precis(u6.11) %>% pander()
  mean sd 5.5% 94.5% n_eff Rhat4
a -0.118 0.1002 -0.2777 0.03772 5075 1
b_gc -0.8383 0.1077 -1.009 -0.6643 4119 1.001
b_pc 1.786 0.04458 1.715 1.859 4070 0.9995
sigma 1.428 0.07248 1.32 1.547 4463 0.9997

Here’s our version of Figure 6.5.

Families2 <-
  Families %>% 
  mutate(
    group = ifelse(p >= quantile(p, prob = .45) & p <= quantile(p, prob = .60),
                     "middle", "tails"),
    u = factor(u, labels = c("bad", "good")) 
  )

gf_point(c ~ g, data = Families2,
           color = ~u, shape = ~ group, size = 2.5, stroke = 1/4) %>%
  gf_lm(c ~ g, data = Families2 %>% filter(group == "middle"), 
        group = 1, inherit = FALSE, color = "gray50", fullrange = TRUE) %>%
  gf_text(5.5 ~ -2, label = "good neighborhoods", inherit = FALSE,
          color = "red", data = NA) %>%
  gf_text(-8 ~ 1.5, label = "bad neighborhoods", inherit = FALSE,
          color = "black", data = NA) %>%
  gf_refine(
    scale_shape_manual(values = c(19, 1)), 
    scale_color_manual(values = c("black", "red"))
  ) %>%
  gf_labs(
    shape = "Parents' education",
    color = "Neighborhood",
    title = "Solid dots for parents in 45th to 60th percentile for education"
  )
u6.12 <-
  ulam(
    data = Families,
    alist(
      c ~ dnorm(mu, sigma),
      mu <- a + b_pc * p + b_gc * g + b_uc * u,
      a ~ dnorm(0, 1),
      c(b_pc, b_gc, b_uc) ~ dnorm(0,1),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.12")
stanfit(u6.12) %>%
  mcmc_areas(pars = vars(-lp__))

precis(u6.12) %>% pander()
  mean sd 5.5% 94.5% n_eff Rhat4
a -0.1214 0.07131 -0.2368 -0.008475 3282 1
b_uc 1.992 0.1556 1.738 2.247 1809 1.002
b_gc -0.04266 0.1026 -0.2067 0.118 1967 1.001
b_pc 1.013 0.06927 0.9036 1.127 1753 1.002
sigma 1.035 0.05181 0.9563 1.121 2994 0.9994

Now the posterior for \(\beta_\text g\) is hovering around 0, where it belongs.