Goals for today

Fit models illustrating collider bias
Some things along the way
- tribble() for row-wise descriptions of a tibble.
- CalvinBayes::gf_dag() for pretty plots of DAGs
- Creating simulations

This material is based on Rethinking 6.3.

DAG of the day

The highlighted node C is appropriately called a collider.
Colliders can be embedded in larger DAGs.
Collliders have consequences for how we interpret our models.

Example: trustworthiness/newsworthiness

T: trustworthiness of a proposal
N: newsworthiness of a proposal
S: whether the proposal is selected

A reasonable DAG:

library(ggdag)
dag_coords <-
  tribble(~name, ~x, ~ y,
          "T", 1, 1,
          "S", 2, 1,
          "N", 3, 1 )

dagify(S ~ T + N, coords = dag_coords) %>%
  gg_dag()

Higher trustworthiness $\to$ higher chance of selection
Higher newsworthiness $\to$ higher chance of selection

This makes S a collider.

This induces a (non-causal) negative association between trustworthiness and newsworthiness. Do you see why?

Here’s how McElreath explains it:

The fact that two arrows enter S means it is a collider. The core concept is easy to understand: When you condition on a collider, it creates statistical–but not necessarily causal–associations among its causes. In this case, once you learn that a proposal has been selected ($S$), then learning its trustworthiness ($T$) also provides information about its newsworthiness ($N$). Why? Because if, for example, a selected proposal has low trustworthiness, then it must have high newsworthiness. Otherwise it wouldn’t have been funded. The same works in reverse: If a proposal has low newsworthiness, we’d infer that it must have higher than average trustworthiness. Otherwise it would not have been selected for funding. (p. 176. emphasis in the original)

Example: Collider of false sorrow

Let’s consider another example where happiness and age both directly influence whether someone gets married, but getting married doesn’t change your age or your happiness.

H: happiness
M: marriage
A: age

Here’s the new DAG – basically the same as in the previous example.

# instead of building the coordinates from scratch, we can
# just relabel our previous coordinates using mutate()

dagify(M ~ H + A,
       coords = dag_coords %>%
         mutate(name = c("H", "M", "A"))) %>%
  gg_dag()

In this made-up example,

happiness ($H$) and age ($A$) both cause marriage ($M$). Marriage is therefore a collider. Even though there is no causal association between happiness and age, if we condition on marriage–which means here, if we include it as a predictor in a regression–then it will induce a statistical association between age and happiness. And this can mislead us to think that happiness changes with age, when in fact it is constant.

To convince you of this, let’s do another simulation. (pp. 176–177)

We could use McElreath’s rethinking::sim_happiness(), or we could redo that computation using code the looks more familiar to us. Either way, the simulation works as follows: Each year

Everyone gets one year older.
Some adults marry; happier people are more likely to marry.
Old people die or move away.
New people are born with a variety of happiness ratings.

# Return a data frame with n 1-year-olds of varying happiness
new_borns <- function(n = 20) {
  tibble(
    A = 1,   # 1 year old
    M = 0,   # not married
    H = seq(-2, 2, length.out = n)   # range of happiness scores
  )
}

# Everyone ages;
# Some people get married;
# Old people die;
# New people are born.
update_population <- function(
  pop, N_births = 20, aom = 18, max_age = 65) {
  
  pop %>%
    mutate(
      A = A + 1, # everyone gets one year older
      # some people get married
      M = ifelse(M >= 1, 1, (A >= aom) * rbinom(nrow(pop), 1, inv_logit(H - 4)))
    ) %>%
    filter(A <= max_age) %>%    # old people die
    bind_rows(                  # new people are born
      new_borns(N_births)
    )
}

# Run the population simulation for 1000 years.
set.seed(1977)
Happy <- new_borns(20)

# one of the few situations where a for loop in R makes sense
for(i in 2:1000) {
  Happy <- update_population(Happy, aom = 18, max_age = 65, N_births = 20)
}
Happy <- Happy %>% rename(age = A, married = M, happiness = H)

Happy %>% reactable::reactable()

Summarize the variables.

df_stats( ~ married  + age + happiness, data = Happy) %>% pander()

response	min	Q1	median	Q3	max	mean	sd	n
married	0	0	0	1	1	0.2977	0.4574	1300
age	1	17	33	49	65	33	18.77	1300
happiness	-2	-1	-1.11e-16	1	2	-1e-16	1.214	1300

Here’s our version of Figure 6.5.

Happy %>% 
  mutate(
    married = factor(married, labels = c("unmarried", "married"))
  ) %>% 
  gf_point(happiness ~ age, color = ~married, size = 1.75) %>%
  gf_refine(
    scale_color_manual(NULL, values = c("grey85", "forestgreen")),
    scale_x_continuous(expand = c(.015, .015)) 
  ) %>%
  gf_theme(panel.grid = element_blank())

Here’s the likelihood for the simple Gaussian multivariable model predicting happiness from age and marriage status:

\[ \begin{align*} \text{happiness}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha_{\text{married} [i]} + \beta_1 \text{age}_i ,\\ \end{align*} \]

where $\text{married}[i]$ is the marriage status of individual $i$. Here we make Happy18, the subset of Happy containing only those 18 and up. We then make a new age variable, a, which is scaled such that $18 = 0$, $65 = 1$, and so on.

Happy18 <-
  Happy %>% 
  filter(age > 17) %>% 
  mutate(
    A = (age - 18) / (65 - 18),
    midx = married + 1    #  0/1 -> 1/2
  )

df_stats(~ age + A + married + midx + happiness, data = Happy18) %>%
  pander::pander()

response	min	Q1	median	Q3	max	mean	sd	n
age	18	29.75	41.5	53.25	65	41.5	13.86	960
A	0	0.25	0.5	0.75	1	0.5	0.2949	960
married	0	0	0	1	1	0.4031	0.4908	960
midx	1	1	1	2	2	1.403	0.4908	960
happiness	-2	-1	-1.11e-16	1	2	-1e-16	1.215	960

Now let’s fit two models: one with both predictors and one without marriage status.

With respect to priors,

> happiness is on an arbitrary scale, in these data, from $-2$ to $+2$. So our imaginary strongest relationship, taking happiness from maximum to minimum, has a slope with rise over run of $(2 - (-2))/1 = 4$. Remember that 95% of the mass of a normal distribution is contained within 2 standard deviations. So if we set the standard deviation of the prior to half of 4, we are saying that we expect 95% of plausible slopes to be less than maximally strong. That isn't a very strong prior, but again, it at least helps bound inference to realistic ranges. Now for the intercepts. Each $\alpha$ is the value of $\mu_i$ when $A_i = 0$. In this case, that means at age 18. So we need to allow $\alpha$ to cover the full range of happiness scores. $\operatorname{Normal}(0, 1)$ will put 95% of the mass in the $-2$ to $+2$ interval. (p. 178)

u6.9 <- 
  ulam(
    data = Happy18,
    alist(
      happiness ~ dnorm(mu, sigma),
      mu <- a[midx] + ba * A,
      a[midx] ~ dnorm(0, 1),
      ba ~ dnorm(0, 2),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.9")

stanfit(u6.9) %>% 
  mcmc_areas(pars = vars(-lp__))

precis(u6.9, depth = 2) %>% pander()

	mean	sd	5.5%	94.5%	n_eff	Rhat4
a[1]	-0.1843	0.06751	-0.2914	-0.07692	1633	1.004
a[2]	1.247	0.08877	1.107	1.389	1685	1.003
ba	-0.769	0.1178	-0.9605	-0.5857	1581	1.004
sigma	1.018	0.02311	0.9815	1.055	2025	1.001

Now drop marriage status, midx.

u6.10 <- 
  ulam(
    data = Happy18,
    alist(
      happiness ~ dnorm(mu, sigma),
      mu <- a + ba * A,
      a ~ dnorm(0, 1),
      ba ~ dnorm(0, 2),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.10")

stanfit(u6.10) %>%
  mcmc_areas(pars = vars(-lp__))

precis(u6.10) %>% pander()

	mean	sd	5.5%	94.5%	n_eff	Rhat4
a	-0.002135	0.07448	-0.1236	0.1164	1501	1.004
ba	0.004196	0.1283	-0.1997	0.2091	1480	1.005
sigma	1.216	0.02808	1.173	1.262	2049	1.001

Wow. So when we take out marriage status, the coefficient for age drops to zero.

The pattern above is exactly what we should expect when we condition on a collider. The collider is marriage status. It is a common consequence of age and happiness. As a result, when we condition on it, we induce a spurious association between the two causes. So it looks like, to model [u6.9], that age is negatively associated with happiness. But this is just a statistical association, not a causal association. Once we know whether someone is married or not, then their age does provide information about how happy they are. (p. 179)

A little further down, McElreath gets rough:

It’s easy to plead with this example. Shouldn’t marriage also influence happiness? What if happiness does change with age? But this misses the point. If you don’t have a causal model, you can’t make inferences from a multiple regression. And the regression itself does not provide the evidence you need to justify a causal model. Instead, you need some science. (pp. 179–180, emphasis added)

Example: The haunted DAG

It gets worse.

Unmeasured causes can still induce collider bias. So I’m sorry to say that we also have to consider the possibility that our DAG may be haunted" (p. 180, emphasis added).

Example: Let’s consider the education level in 3 generates of a family: grand parents, parents, and children.

Here’s the unhaunted DAG.

dag_coords <-
  tribble(~name, ~x, ~ y,
          "G", 1, 2,
          "P", 2, 2,
          "C", 2, 1)
dagify(P ~ G,
       C ~ P + G,
       coords = dag_coords) %>%
  gg_dag()

This says

Grandparents’ education affects parents’ education.
Grandparents’ education affects children’s education.
Parents’ education affects children’s education.

These all seem reasonable – parents with high education levels probably encourange and support their children and grandchildren to also have high levels of education.

Now we add the haunting variable, U (U for unmeasured).

dag_coords <-
  tribble( ~name, ~x, ~y,
           "G", 1, 2,
           "P", 2, 2,
           "C", 2, 1,
           "U", 2.5, 1.5)

dagify(P ~ G + U,
       C ~ P + G + U,
       coords = dag_coords) %>%
  gg_dag(highlight = "U")

This says that there is something else (perhaps the neighborhood where someone lives) that affects education level of both parents and children.

Let’s simulate some data where we assign values to the size of these various effects. Note that we are setting the direct affect of grandparents on children to 0.

family_sim <- function(
  n,
  b_gp = 1,  # direct effect of G on P
  b_gc = 0,  # direct effect of G on C
  b_pc = 1,  # direct effect of P on C
  b_up = 2,  # direct effect of U on P 
  b_uc = 2,  # direct effect of U on C
  seed = 1
) {
  set.seed(seed)
  tibble(
    u = 2 * rbinom(n, size = 1, prob = .5) - 1,  # 1 or -1
    g = rnorm(n, mean = 0, sd = 1)) %>% 
    mutate(p = rnorm(n, mean = b_gp * g + b_up * u, sd = 1)) %>% 
    mutate(c = rnorm(n, mean = b_pc * p + b_gc * g + b_uc * u, sd = 1))
}

Families <- family_sim(200)

u6.11 <-
  ulam(
    data = Families,
    alist(
      c ~ dnorm(mu, sigma),
      mu <- a + b_pc * p + b_gc * g,
      a ~ dnorm(0, 1),
      c(b_pc, b_gc) ~ dnorm(0,1),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.11")

stanfit(u6.11) %>%
  mcmc_areas(pars = vars(-lp__))

precis(u6.11) %>% pander()

	mean	sd	5.5%	94.5%	n_eff	Rhat4
a	-0.118	0.1002	-0.2777	0.03772	5075	1
b_gc	-0.8383	0.1077	-1.009	-0.6643	4119	1.001
b_pc	1.786	0.04458	1.715	1.859	4070	0.9995
sigma	1.428	0.07248	1.32	1.547	4463	0.9997

Here’s our version of Figure 6.5.

Families2 <-
  Families %>% 
  mutate(
    group = ifelse(p >= quantile(p, prob = .45) & p <= quantile(p, prob = .60),
                     "middle", "tails"),
    u = factor(u, labels = c("bad", "good")) 
  )

gf_point(c ~ g, data = Families2,
           color = ~u, shape = ~ group, size = 2.5, stroke = 1/4) %>%
  gf_lm(c ~ g, data = Families2 %>% filter(group == "middle"), 
        group = 1, inherit = FALSE, color = "gray50", fullrange = TRUE) %>%
  gf_text(5.5 ~ -2, label = "good neighborhoods", inherit = FALSE,
          color = "red", data = NA) %>%
  gf_text(-8 ~ 1.5, label = "bad neighborhoods", inherit = FALSE,
          color = "black", data = NA) %>%
  gf_refine(
    scale_shape_manual(values = c(19, 1)), 
    scale_color_manual(values = c("black", "red"))
  ) %>%
  gf_labs(
    shape = "Parents' education",
    color = "Neighborhood",
    title = "Solid dots for parents in 45th to 60th percentile for education"
  )

u6.12 <-
  ulam(
    data = Families,
    alist(
      c ~ dnorm(mu, sigma),
      mu <- a + b_pc * p + b_gc * g + b_uc * u,
      a ~ dnorm(0, 1),
      c(b_pc, b_gc, b_uc) ~ dnorm(0,1),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.12")

stanfit(u6.12) %>%
  mcmc_areas(pars = vars(-lp__))

precis(u6.12) %>% pander()

	mean	sd	5.5%	94.5%	n_eff	Rhat4
a	-0.1214	0.07131	-0.2368	-0.008475	3282	1
b_uc	1.992	0.1556	1.738	2.247	1809	1.002
b_gc	-0.04266	0.1026	-0.2067	0.118	1967	1.001
b_pc	1.013	0.06927	0.9036	1.127	1753	1.002
sigma	1.035	0.05181	0.9563	1.121	2994	0.9994

Now the posterior for $\beta_\text g$ is hovering around 0, where it belongs.

Multiple Regression

Collider Bias

Stat 341

Goals for today

DAG of the day

Example: trustworthiness/newsworthiness

Example: Collider of false sorrow

Example: The haunted DAG