Goals for today

Fit models illustrating post-treatment bias.
Begin using DAGs to think through modeling choices.
See how we can start building more interesting models by combining elements we already know about in new ways.

This material is based on Rethinking 6.2.

An experiment with plants

A note on the word “experiment”

The word experiment has a specific meaning in statistics. A study is an experiment if one or more of the variables had values determined by the design of the study. Typically these are assigned in some random way.

On well-known type of experiment is the randomized controlled experiment (RCT). In an RCT, subjects are randomly assigned to multiple groups. Each group is given a diferrent treatment, and one of the treatments is a “control” treatment in which either nothing is done, or the current standard treatment is used. This is how the efficacy of the COVID-vaccines was established. Subjects were randomly assigned to get either a vaccine or an injection of something inert.

In addition to the response variable (e.g., getting COVID) and the primary explanatory variable (e.g., whehter one received a vaccine), a RCT can include many other variables, sometime called covariates (examples: age, sex, general health, pre-existing health conditions, etc, etc.)

Studies that are not experiments are called observational studies.

Experiment Design

Here are the steps as outlined in the text:

Seed and sprout plants
Measure heights
Apply different antifungal soil treatments (i.e., the treatment)
Measure (a) the heights and (b) the presence of fungus

Simulating data

Let’s simulate some data assuming that

applying the treatment reduces the change of fungus, and
plants grow less well when there is fungus.

That still leaves some things to be decided like

What is the distribution of plant heights before we apply the treatment?
What is the probability of fungus with/without the antifungal treatment?
How much do plants grow with/without fungus?

All of these need to be decided in order to create a simulation. To make that clear, let’s create a function that has arguments for all of these decisions.

plant_sim <- 
  function(
    n = 100, 
    seed = 71, 
    mean0 = 10, sd0 = 2,
    p_fungus = c(0.5, 0.1),
    b0 = 5, b1 = -3,
    sd1 = 1
  ) {
    
    set.seed(seed)
    tibble(
      plant     = 1:n,
      h0        = rnorm(n, mean0, sd0) %>% round(1),
      treatment = sample(0:1, size = n, replace = TRUE),
      # 50% or 10% chance of having fungus, depending on treatment
      fungus    = rbinom(n, size = 1, prob = p_fungus[1 + treatment]),
      # growth is stunted by fungus (on average)
      h1        = h0 + rnorm(n, mean = b0 + b1 * fungus, sd = sd1) %>% round(1)
    )
  }

Plants <- plant_sim(100)
Plants %>%
  glimpse()

## Rows: 100
## Columns: 5
## $ plant     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ h0        <dbl> 9.1, 9.1, 9.0, 10.8, 9.2, 7.6, 7.9, 12.0, 7.8, 12.5, 10.6, 1…
## $ treatment <int> 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, …
## $ fungus    <int> 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ h1        <dbl> 14.7, 12.9, 11.5, 17.5, 11.9, 8.0, 13.5, 15.2, 15.0, 16.6, 1…

And here’s a quick summary:

Plants %>% df_stats(~ h0 + h1 + fungus| treatment) %>% reactable::reactable()

A model for plant growth – no predictors

Let’s see if we can figure out this model:

u6.6 <- 
  ulam(
    data = Plants,
    alist(
      h1 ~ dnorm(mu, sigma),
      mu <- h0 * p,
      p ~ dlnorm(0, 0.25),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.6")

p: propotional height change. p = 1 means no change in height
- McElreath claims this will make it easier to set a prior, since we only need to think about relative heights and not about the absolute amount of growth.
- We know \(p > 0\), and we shuould allow for \(p < 1\) (in case something bad happens and the plants shrivel), but we expect that \(p > 1\) if things are going well with our plants.
p ~ dlnorm(0, 0.25)
```
gf_dist("lnorm", meanlog = 0, sdlog = 0.25)
```
```
df_stats( ~ rlnorm(1e4, meanlog = 0, sdlog = 0.25), data.frame(), mean_hdi) %>% pander::pander()
```
response y ymin ymax .width .point .interval

rlnorm(10000, meanlog = 0, sdlog = 0.25) 1.037 0.5797 1.571 0.95 mean hdi

“This prior expects anything from 40% shrinkage up to 50% growth” (p. 172). [Looks like 60% growth might be a fairer statement.]

response	y	ymin	ymax	.width	.point	.interval
rlnorm(10000, meanlog = 0, sdlog = 0.25)	1.037	0.5797	1.571	0.95	mean	hdi

As a reminder, here is what the model looks like in mathematical notation:

\[\begin{align*} h_{1i} & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = h_{0i} \cdot p \\ p & \sim \operatorname{Log-Normal}(0, 0.25) \\ \sigma & \sim \operatorname{Exponential}(1). \end{align*}\]

So what does out model say about average growth?

precis(u6.6)

##           mean         sd     5.5%    94.5%    n_eff    Rhat4
## p     1.441864 0.01882442 1.411759 1.472282 2601.160 1.001675
## sigma 1.915346 0.13859657 1.708467 2.147816 2712.361 1.000392

stanfit(u6.6) %>%
  mcmc_areas(pars = vars(-lp__))

So then, the expectation is an increase of about 44 percent relative to \(h_0\). (HPDI: between 41 and 47 percent relative to \(h_0\).)

But this isn’t the best model. We’re leaving important predictors on the table.

Modeling the treatment

Let’s model how the treatment and fungus change the proportion of growth.

\[\begin{align*} h_{1i} & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = h_{0,i} \times p \\ p & = \beta_0 + \beta_t \text{treatment}_i + \beta_f \text{fungus}_i \\ \alpha & \sim \operatorname{Log-Normal}(0, 0.25) \\ \beta_1 & \sim \operatorname{Normal}(0, 0.5) \\ \beta_2 & \sim \operatorname{Normal}(0, 0.5) \\ \sigma & \sim \operatorname{Exponential}(1). \end{align*}\]

u6.7 <- 
  ulam(
    data = Plants,
    alist(
      h1 ~ dnorm(mu, sigma),
      mu <- h0 * p,
      p <- a + bt * treatment + bf * fungus,
      a ~ dlnorm(0, 0.2),
      bt ~ dnorm(0, 0.5),
      bf ~ dnorm(0, 0.5),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.7")

precis(u6.7)

##              mean         sd        5.5%       94.5%    n_eff    Rhat4
## a      1.48868370 0.02445000  1.44897463  1.52788108 1573.736 1.001172
## bt     0.02037458 0.02968505 -0.02790833  0.06716938 1848.148 1.000571
## bf    -0.30301507 0.03625179 -0.36131268 -0.24685010 2254.700 1.001817
## sigma  1.45411556 0.10446599  1.29718604  1.63454811 2190.612 1.000073

stanfit(u6.7) %>%
  mcmc_areas(pars = vars(-lp__))

Since we know that the treatment had a causal effect on growth (we simulated it that way), the results of precis() are perhaps unexpected. What happened?

Leaving out fungus

u6.8 <- 
  ulam(
    data = Plants,
    alist(
      h1 ~ dnorm(mu, sigma),
      mu <- h0 * p,
      p <- a + bt * treatment,
      a ~ dlnorm(0, 0.2),
      bt ~ dnorm(0, 0.5),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.8")

precis(u6.8)

##            mean         sd       5.5%     94.5%    n_eff     Rhat4
## a     1.3901830 0.02747524 1.34790638 1.4349837 1764.302 1.0021633
## bt    0.0892491 0.03719606 0.02964513 0.1496096 1833.643 1.0005896
## sigma 1.8756798 0.13322999 1.67316769 2.1027480 2605.915 0.9995328

stanfit(u6.8) %>%
  mcmc_areas(pars = vars(-lp__))

Modeling fungus

This a bit of an aside, but it shows how flexible our modeling system is. Here we are modeling a binomial (Bernoulli actually) response rather than a normal response.

u6.8a <- 
  ulam(
    data = Plants,
    alist(
      fungus ~ dbinom(1, p),        # fungus is 1 with proportion p, else 0
      p <- b0 + bt * treatment,     # 2 different values of p, one with treatment, one without
      b0 ~ dlnorm(0, 0.2),
      bt ~ dnorm(0, 0.2)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 6,
    file = "fits/u6.8a")

Thinking with DAGs

The DAG we had in mind

Let’s make a DAG.

library(ggdag)  # for fancier plotting of dags and dagify()

# define our coordinates
dag_coords <-
  tibble(name = c("H0", "T", "F", "H1"),
         x    = c(1, 5, 4, 3),
         y    = c(2, 2, 1.5, 1))

dag <-
  dagify(F ~ T,    
         H1 ~ H0 + F,
         coords = dag_coords)

CalvinBayes includes a little function for creating a ggplot from a dag.

library(CalvinBayes)
gg_dag(dag)

Anyway, the DAG clarifies

that learning the treatment tells us nothing about the outcome, once we know the fungus status.

An even more DAG way to say this is that conditioning on \(F\) induces d-separation. The “d” stands for directional. D-separation means that some variables on a directed graph are independent of others. There is no path connecting them. In this case, \(H_1\) is d-separated from \(T\), but only when we condition on \(F\). Conditioning on \(F\) effectively blocks the directed path \(T \rightarrow F \rightarrow H_1\), making \(T\) and \(H_1\) independent (d-separated). (p. 174, emphasis in the original)

Note that our ggdag object, dag, will also work with the dagitty::dseparated() function.

library(dagitty)

dag %>% 
  dseparated("T", "H1")

## [1] FALSE

dag %>% 
  dseparated("T", "H1", "F")

## [1] TRUE

The descriptively-named dagitty::impliedConditionalIndependencies() function will work, too.

impliedConditionalIndependencies(dag)

## F _||_ H0
## H0 _||_ T
## H1 _||_ T | F

A different DAG

Now consider a DAG of a different kind of causal structure.

# define our coordinates
dag_coords <-
  tibble(name = c("H0", "H1", "M", "F", "T"),
         x    = c(1, 2, 2.5, 3, 4),
         y    = c(2, 2, 1, 2, 2))

# save our DAG
dag2 <-
  dagify(F ~ M + T,
         H1 ~ H0 + M,
         coords = dag_coords)

dag2 %>% 
  gg_dag(highlight = "M", size = 10)

Based on McElreath’s R code 6.20, here we simulate some data based on the new DAG.

plant_sim2 <- 
  function(
    n = 100, 
    seed = 71, 
    mean0 =10,
    sd0 = 2,
    prob_m = 0.5
  ) {
    
    set.seed(seed)
    tibble(plant     = 1:n,
           h0        = rnorm(n, mean = mean0, sd = sd0) %>% round(1),
           treatment = rep(0:1, each = n / 2),
           m         = rbinom(n, size = 1, prob = prob_m),
           fungus    = rbinom(n, size = 1, prob = .5 - treatment * 0.4 + 0.4 * m),
           h1        = h0 + rnorm(n, 5 + 3 * m) %>% round(1))
  }

Plants2 <- 
  plant_sim2()

glimpse(Plants2)

## Rows: 100
## Columns: 6
## $ plant     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ h0        <dbl> 9.1, 9.1, 9.0, 10.8, 9.2, 7.6, 7.9, 12.0, 7.8, 12.5, 10.6, 1…
## $ treatment <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ m         <int> 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, …
## $ fungus    <int> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, …
## $ h1        <dbl> 14.7, 12.9, 14.5, 17.5, 17.9, 11.0, 16.5, 15.2, 18.0, 16.6, …

Refitting without recompiling

s6.7b <-
  stan(fit = stanfit(u6.7), 
       refresh = 0,  # don't show progress
       data = Plants2, seed = 6, file = "fits/s6.7b")

s6.8b <-
  stan(fit = stanfit(u6.8), 
       refresh = 0,  # don't show progress
       data = Plants2, seed = 6, file = "fits/s6.8b")

Check the results.

precis(s6.7b)

##             mean         sd        5.5%      94.5%    n_eff     Rhat4
## a     1.55063412 0.03922663  1.48915528 1.61319144 1755.689 0.9996112
## bt    0.01635947 0.04531657 -0.05523696 0.08825102 1881.729 1.0000025
## bf    0.16871736 0.04523753  0.09772354 0.24203731 2101.876 1.0007290
## sigma 2.21756447 0.16173121  1.98445551 2.49293046 2462.666 1.0010589

precis(s6.8b)

##              mean         sd       5.5%      94.5%    n_eff    Rhat4
## a      1.63974979 0.03299560  1.5879065 1.69249943 2366.552 1.000585
## bt    -0.02972402 0.04538268 -0.1007802 0.04294617 2311.526 0.999913
## sigma  2.33165449 0.16630195  2.0791447 2.61343838 2405.022 1.000814

s6.7b %>%
  mcmc_areas(pars = vars(-lp__))

s6.8b %>%
  mcmc_areas(pars = vars(-lp__))

Including fungus again confounds inference about the treatment, this time by making it seem like it helped the plants, even though it had no effect (p. 175).

Rethinking: Model selection doesn’t help.

In the next chapter, you’ll learn about model selection using information criteria. Like other model comparison and selection schemes, these criteria help in contrasting and choosing model structure. But such approaches are no help in the example presented just above, since the model that includes fungus both fits the sample better and would make better out-of-sample predictions. Model [b6.7] misleads because it asks the wrong question, not because it would make poor predictions. As argued in Chapter 1, prediction and causal inference are just not the same task. No statistical procedure can substitute for scientific knowledge and attention to it. We need multiple models because they help us understand causal paths, not just so we can choose one or another for prediction. (p. 175)

Multiple Regression

Post-Treatment Bias

Stat 341