Fit models illustrating collider bias
Some things along the way
tribble()
for row-wise descriptions of a tibble.CalvinBayes::gf_dag()
for pretty plots of DAGsThis material is based on Rethinking 6.3.
A reasonable DAG:
library(ggdag)
dag_coords <-
tribble(~name, ~x, ~ y,
"T", 1, 1,
"S", 2, 1,
"N", 3, 1 )
dagify(S ~ T + N, coords = dag_coords) %>%
gg_dag()
This makes S a collider.
Here’s how McElreath explains it:
The fact that two arrows enter S means it is a collider. The core concept is easy to understand: When you condition on a collider, it creates statistical–but not necessarily causal–associations among its causes. In this case, once you learn that a proposal has been selected (\(S\)), then learning its trustworthiness (\(T\)) also provides information about its newsworthiness (\(N\)). Why? Because if, for example, a selected proposal has low trustworthiness, then it must have high newsworthiness. Otherwise it wouldn’t have been funded. The same works in reverse: If a proposal has low newsworthiness, we’d infer that it must have higher than average trustworthiness. Otherwise it would not have been selected for funding. (p. 176. emphasis in the original)
Let’s consider another example where happiness and age both directly influence whether someone gets married, but getting married doesn’t change your age or your happiness.
Here’s the new DAG – basically the same as in the previous example.
# instead of building the coordinates from scratch, we can
# just relabel our previous coordinates using mutate()
dagify(M ~ H + A,
coords = dag_coords %>%
mutate(name = c("H", "M", "A"))) %>%
gg_dag()
In this made-up example,
happiness (\(H\)) and age (\(A\)) both cause marriage (\(M\)). Marriage is therefore a collider. Even though there is no causal association between happiness and age, if we condition on marriage–which means here, if we include it as a predictor in a regression–then it will induce a statistical association between age and happiness. And this can mislead us to think that happiness changes with age, when in fact it is constant.
To convince you of this, let’s do another simulation. (pp. 176–177)
We could use McElreath’s rethinking::sim_happiness()
, or we could redo that computation using code the looks more familiar to us. Either way, the simulation works as follows: Each year
# Return a data frame with n 1-year-olds of varying happiness
new_borns <- function(n = 20) {
tibble(
A = 1, # 1 year old
M = 0, # not married
H = seq(-2, 2, length.out = n) # range of happiness scores
)
}
# Everyone ages;
# Some people get married;
# Old people die;
# New people are born.
update_population <- function(
pop, N_births = 20, aom = 18, max_age = 65) {
pop %>%
mutate(
A = A + 1, # everyone gets one year older
# some people get married
M = ifelse(M >= 1, 1, (A >= aom) * rbinom(nrow(pop), 1, inv_logit(H - 4)))
) %>%
filter(A <= max_age) %>% # old people die
bind_rows( # new people are born
new_borns(N_births)
)
}
# Run the population simulation for 1000 years.
set.seed(1977)
Happy <- new_borns(20)
# one of the few situations where a for loop in R makes sense
for(i in 2:1000) {
Happy <- update_population(Happy, aom = 18, max_age = 65, N_births = 20)
}
Happy <- Happy %>% rename(age = A, married = M, happiness = H)
Happy %>% reactable::reactable()
Summarize the variables.
df_stats( ~ married + age + happiness, data = Happy) %>% pander()
response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
married | 0 | 0 | 0 | 1 | 1 | 0.2977 | 0.4574 | 1300 | 0 |
age | 1 | 17 | 33 | 49 | 65 | 33 | 18.77 | 1300 | 0 |
happiness | -2 | -1 | -1.11e-16 | 1 | 2 | -1e-16 | 1.214 | 1300 | 0 |
Here’s our version of Figure 6.5.
Happy %>%
mutate(
married = factor(married, labels = c("unmarried", "married"))
) %>%
gf_point(happiness ~ age, color = ~married, size = 1.75) %>%
gf_refine(
scale_color_manual(NULL, values = c("grey85", "forestgreen")),
scale_x_continuous(expand = c(.015, .015))
) %>%
gf_theme(panel.grid = element_blank())
Here’s the likelihood for the simple Gaussian multivariable model predicting happiness from age and marriage status:
\[ \begin{align*} \text{happiness}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha_{\text{married} [i]} + \beta_1 \text{age}_i ,\\ \end{align*} \]
where \(\text{married}[i]\) is the marriage status of individual \(i\). Here we make Happy18
, the subset of Happy
containing only those 18 and up. We then make a new age
variable, a
, which is scaled such that \(18 = 0\), \(65 = 1\), and so on.
Happy18 <-
Happy %>%
filter(age > 17) %>%
mutate(
A = (age - 18) / (65 - 18),
midx = married + 1 # 0/1 -> 1/2
)
df_stats(~ age + A + married + midx + happiness, data = Happy18) %>%
pander::pander()
response | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
age | 18 | 29.75 | 41.5 | 53.25 | 65 | 41.5 | 13.86 | 960 | 0 |
A | 0 | 0.25 | 0.5 | 0.75 | 1 | 0.5 | 0.2949 | 960 | 0 |
married | 0 | 0 | 0 | 1 | 1 | 0.4031 | 0.4908 | 960 | 0 |
midx | 1 | 1 | 1 | 2 | 2 | 1.403 | 0.4908 | 960 | 0 |
happiness | -2 | -1 | -1.11e-16 | 1 | 2 | -1e-16 | 1.215 | 960 | 0 |
Now let’s fit two models: one with both predictors and one without marriage status.
With respect to priors,
u6.9 <-
ulam(
data = Happy18,
alist(
happiness ~ dnorm(mu, sigma),
mu <- a[midx] + ba * A,
a[midx] ~ dnorm(0, 1),
ba ~ dnorm(0, 2),
sigma ~ dexp(1)
),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/u6.9")
stanfit(u6.9) %>%
mcmc_areas(pars = vars(-lp__))
precis(u6.9, depth = 2) %>% pander()
mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|
a[1] | -0.1843 | 0.06751 | -0.2914 | -0.07692 | 1633 | 1.004 |
a[2] | 1.247 | 0.08877 | 1.107 | 1.389 | 1685 | 1.003 |
ba | -0.769 | 0.1178 | -0.9605 | -0.5857 | 1581 | 1.004 |
sigma | 1.018 | 0.02311 | 0.9815 | 1.055 | 2025 | 1.001 |
Now drop marriage status, midx
.
u6.10 <-
ulam(
data = Happy18,
alist(
happiness ~ dnorm(mu, sigma),
mu <- a + ba * A,
a ~ dnorm(0, 1),
ba ~ dnorm(0, 2),
sigma ~ dexp(1)
),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/u6.10")
stanfit(u6.10) %>%
mcmc_areas(pars = vars(-lp__))
precis(u6.10) %>% pander()
mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|
a | -0.002135 | 0.07448 | -0.1236 | 0.1164 | 1501 | 1.004 |
ba | 0.004196 | 0.1283 | -0.1997 | 0.2091 | 1480 | 1.005 |
sigma | 1.216 | 0.02808 | 1.173 | 1.262 | 2049 | 1.001 |
Wow. So when we take out marriage status, the coefficient for age drops to zero.
The pattern above is exactly what we should expect when we condition on a collider. The collider is marriage status. It is a common consequence of age and happiness. As a result, when we condition on it, we induce a spurious association between the two causes. So it looks like, to model [
u6.9
], that age is negatively associated with happiness. But this is just a statistical association, not a causal association. Once we know whether someone is married or not, then their age does provide information about how happy they are. (p. 179)
A little further down, McElreath gets rough:
It’s easy to plead with this example. Shouldn’t marriage also influence happiness? What if happiness does change with age? But this misses the point. If you don’t have a causal model, you can’t make inferences from a multiple regression. And the regression itself does not provide the evidence you need to justify a causal model. Instead, you need some science. (pp. 179–180, emphasis added)
It gets worse.
Unmeasured causes can still induce collider bias. So I’m sorry to say that we also have to consider the possibility that our DAG may be haunted" (p. 180, emphasis added).
Example: Let’s consider the education level in 3 generates of a family: grand parents, parents, and children.
Here’s the unhaunted DAG.
dag_coords <-
tribble(~name, ~x, ~ y,
"G", 1, 2,
"P", 2, 2,
"C", 2, 1)
dagify(P ~ G,
C ~ P + G,
coords = dag_coords) %>%
gg_dag()
This says
These all seem reasonable – parents with high education levels probably encourange and support their children and grandchildren to also have high levels of education.
Now we add the haunting variable, U
(U for unmeasured).
dag_coords <-
tribble( ~name, ~x, ~y,
"G", 1, 2,
"P", 2, 2,
"C", 2, 1,
"U", 2.5, 1.5)
dagify(P ~ G + U,
C ~ P + G + U,
coords = dag_coords) %>%
gg_dag(highlight = "U")
This says that there is something else (perhaps the neighborhood where someone lives) that affects education level of both parents and children.
Let’s simulate some data where we assign values to the size of these various effects. Note that we are setting the direct affect of grandparents on children to 0.
family_sim <- function(
n,
b_gp = 1, # direct effect of G on P
b_gc = 0, # direct effect of G on C
b_pc = 1, # direct effect of P on C
b_up = 2, # direct effect of U on P
b_uc = 2, # direct effect of U on C
seed = 1
) {
set.seed(seed)
tibble(
u = 2 * rbinom(n, size = 1, prob = .5) - 1, # 1 or -1
g = rnorm(n, mean = 0, sd = 1)) %>%
mutate(p = rnorm(n, mean = b_gp * g + b_up * u, sd = 1)) %>%
mutate(c = rnorm(n, mean = b_pc * p + b_gc * g + b_uc * u, sd = 1))
}
Families <- family_sim(200)
u6.11 <-
ulam(
data = Families,
alist(
c ~ dnorm(mu, sigma),
mu <- a + b_pc * p + b_gc * g,
a ~ dnorm(0, 1),
c(b_pc, b_gc) ~ dnorm(0,1),
sigma ~ dexp(1)
),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/u6.11")
stanfit(u6.11) %>%
mcmc_areas(pars = vars(-lp__))
precis(u6.11) %>% pander()
mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|
a | -0.118 | 0.1002 | -0.2777 | 0.03772 | 5075 | 1 |
b_gc | -0.8383 | 0.1077 | -1.009 | -0.6643 | 4119 | 1.001 |
b_pc | 1.786 | 0.04458 | 1.715 | 1.859 | 4070 | 0.9995 |
sigma | 1.428 | 0.07248 | 1.32 | 1.547 | 4463 | 0.9997 |
Here’s our version of Figure 6.5.
Families2 <-
Families %>%
mutate(
group = ifelse(p >= quantile(p, prob = .45) & p <= quantile(p, prob = .60),
"middle", "tails"),
u = factor(u, labels = c("bad", "good"))
)
gf_point(c ~ g, data = Families2,
color = ~u, shape = ~ group, size = 2.5, stroke = 1/4) %>%
gf_lm(c ~ g, data = Families2 %>% filter(group == "middle"),
group = 1, inherit = FALSE, color = "gray50", fullrange = TRUE) %>%
gf_text(5.5 ~ -2, label = "good neighborhoods", inherit = FALSE,
color = "red", data = NA) %>%
gf_text(-8 ~ 1.5, label = "bad neighborhoods", inherit = FALSE,
color = "black", data = NA) %>%
gf_refine(
scale_shape_manual(values = c(19, 1)),
scale_color_manual(values = c("black", "red"))
) %>%
gf_labs(
shape = "Parents' education",
color = "Neighborhood",
title = "Solid dots for parents in 45th to 60th percentile for education"
)
u6.12 <-
ulam(
data = Families,
alist(
c ~ dnorm(mu, sigma),
mu <- a + b_pc * p + b_gc * g + b_uc * u,
a ~ dnorm(0, 1),
c(b_pc, b_gc, b_uc) ~ dnorm(0,1),
sigma ~ dexp(1)
),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 6,
file = "fits/u6.12")
stanfit(u6.12) %>%
mcmc_areas(pars = vars(-lp__))
precis(u6.12) %>% pander()
mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|
a | -0.1214 | 0.07131 | -0.2368 | -0.008475 | 3282 | 1 |
b_uc | 1.992 | 0.1556 | 1.738 | 2.247 | 1809 | 1.002 |
b_gc | -0.04266 | 0.1026 | -0.2067 | 0.118 | 1967 | 1.001 |
b_pc | 1.013 | 0.06927 | 0.9036 | 1.127 | 1753 | 1.002 |
sigma | 1.035 | 0.05181 | 0.9563 | 1.121 | 2994 | 0.9994 |
Now the posterior for \(\beta_\text g\) is hovering around 0, where it belongs.