SAT coaching at schools

This example can be found in several locations, including in this blog post entitled “Everything I need to know about Bayesian Statistics, I learned in eight Schools.”

Here is a summary of the study from Gelman et al. (1997), Sec. 5.5:

A study was performed for the Educational Testing Service to analyze the effects of special coaching programs for SAT-V (Scholastic Aptitude Test-Verbal) in each of eight high schools. The outcome variable in each study was the score on a special administration of the SAT-V, a standardized multiple choice test administered by the Educational Testing Service and used to help colleges make admissions decisions; the scores can vary between 200 and 800, with mean about 500 and standard deviation about 100. The SAT examinations are designed to be resistant to short-term efforts directed specifically toward improving performance on the test; instead they are designed to reflect knowledge acquired and abilities developed over many years of education. Nevertheless, each of the eight schools in this study considered its short-term coaching program to be very successful at increasing SAT scores. Also, there was no prior reason to believe that any of the eight programs was more effective than any other or that some were more similar in effect to each other than to any other."

And here are the data – summarised at the school level.

school improvement stderr
A 28.39 14.9
B 7.94 10.2
C -2.75 16.3
D 6.82 11
E -0.64 9.4
F 0.63 11.4
G 18.01 10.4
H 12.16 17.6

Two questions

  • Does the coaching help?

    Are scores reliably higher – on average – after the coaching?

  • Is there a school effect?

    Do some schools do a better job of providing the coaching than others? Or perhaps is the is that the students differ and the students at some schools are helped more by the coaching.

Loading the Data

Let’s load the data into R and create an index variable for the school.

Schools <- tribble(
  ~school, ~improvement, ~stderr,
"A", 28.39, 14.9,
"B", 7.94, 10.2,
"C", -2.75, 16.3,
"D", 6.82, 11.0,
"E", -0.64, 9.4,
"F", 0.63, 11.4,
"G", 18.01, 10.4,
"H", 12.16, 17.6
)
Schools <-
  Schools %>%
  mutate(
    school_idx = as.integer(factor(school))
  )

Goldilocks and the 3 estimates for school A

  • no pooling: School A improves by ~ 28 points

  • complete pooling: School A is like any other school, so improvement is ~ 8 points

    mean( ~improvement, data = Schools)
    ## [1] 8.82

    We should take the uncertainties into account when computing this mean, but a crude estimate suffices for the moment.

  • partial pooling: School A is similar to other schools, but not completely so, estimated improvement should be somewhere between 8 and 28 points.

We’d also like to knwo something about the uncertainty of each of these estimates. We have the standard error for school A to help in the first estimate. If we fit models to make these estimates, then we get the uncertainties from our models.

A model with measurement errror

One way to think of these data is in terms of measurement error. We don’t know the mean improvement for all students at each school, we only know the mean for the students in the study. The standard error tells us how reliabel that estimate is (based on the sample size and the student to student variability).

So we can think of the improvement score as being sampled from some distribution, centered at the true improvement for a school, but with some uncertainty quantified by the standard error.

\[ \mbox{observed improvement} \sim {\sf Dist}(\mbox{true improvement}, \mbox{standard error}) \]

Let’s try modeling this with a normal distribution centered at the true improvement with standard deviation given by the standandard error. (This is a pretty traditional model for this sort of uncertainty, but it isn’t the only way we could do it.)

u8schools <-
  ulam(
    data = Schools %>% select(school_idx, improvement, stderr),
    alist(
      improvement ~ dnorm(mu, stderr),           # observed improvement
      mu <- b[school_idx],                       # each school has a true improvement
      b[school_idx] ~ dnorm(mu_all, sigma_all),  # distribution over schools
      mu_all ~ dnorm(0, 10),                     # prior for average improvemnt over schools
      sigma_all ~ dexp(0.1)                      # prior for variabilty from school to school
    ),
    iter = 2000, chains = 4, cores = 4, refresh = 0,
    file = "fits/u8schools", log_lik = TRUE
  )

What does the model say?

precis(u8schools, depth = 2) %>% pander()
  mean sd 5.5% 94.5% n_eff Rhat4
b[1] 8.702 6.316 -0.004693 19.69 505.8 1.002
b[2] 6.684 5.142 -1.369 14.8 652.9 1.004
b[3] 5.648 6.057 -4.421 14.49 616.7 1.006
b[4] 6.51 5.304 -1.808 14.75 734.3 1.006
b[5] 5.183 5.441 -4.192 13.12 433.6 1.007
b[6] 5.622 5.611 -4.07 14.05 548.7 1.008
b[7] 8.353 5.534 0.1978 17.48 423.3 1.006
b[8] 6.974 5.934 -1.855 16.73 667.2 1.008
mu_all 6.467 4.005 -0.04489 12.58 351.2 1.011
sigma_all 4.259 3.687 0.5898 11.14 169 1.01
u8schools %>% stanfit() %>% mcmc_areas(pars = vars(-lp__))

u8schools %>% stanfit() %>% 
  mcmc_intervals_data(prob_outer = 0.95, prob = 0.50) %>% 
  select(-matches("width")) %>%
  filter(! grepl("log_lik|lp", parameter)) %>%
  pander()
parameter point_est ll l m h hh
b[1] median -1.901 4.682 8.112 11.81 23.95
b[2] median -3.414 3.399 6.692 9.782 17.02
b[3] median -8.423 2.301 6.047 9.48 16.85
b[4] median -4.342 3.365 6.547 9.771 16.69
b[5] median -7.457 2.146 5.691 8.74 14.72
b[6] median -7.163 2.473 6.117 9.286 15.95
b[7] median -1.34 4.858 7.932 11.35 20.93
b[8] median -4.582 3.516 6.822 9.998 20
mu_all median -1.72 3.926 6.576 9.092 14.32
sigma_all median 0.4905 1.516 3.189 5.925 14.07

Sensitiviy Analysis

One way to see how much your choice of prior matters is to refit the model with a different prior to see how much the model changes. Let’s fit a model with less concentrated priors.

u8schools2 <-
  ulam(
    data = Schools %>% select(school_idx, improvement, stderr),
    alist(
      improvement ~ dnorm(mu, stderr),
      mu <- b[school_idx],
      b[school_idx] ~ dnorm(mu_all, sigma_all),
      mu_all ~ dnorm(0, 40),
      sigma_all ~ dexp(0.025)
    ),
    iter = 2000, chains = 4, cores = 4, refresh = 0,
    file = "fits/u8schools2", log_lik = TRUE
  )
coeftab(u8schools, u8schools2) %>% plot()

precis(u8schools2, depth = 2) %>% pander()
  mean sd 5.5% 94.5% n_eff Rhat4
b[1] 11.58 7.821 0.7382 25.34 1030 1.004
b[2] 8.102 6.577 -2.412 18.39 1205 1.002
b[3] 6.5 7.507 -6.419 17.33 1059 1.003
b[4] 7.885 6.708 -2.786 18.31 1235 1.002
b[5] 5.583 6.396 -5.237 15.29 764.4 1.006
b[6] 6.39 6.946 -5.236 16.65 979.1 1.003
b[7] 10.82 6.661 1.067 22.4 949.9 1.004
b[8] 8.692 7.819 -3.373 21.14 1504 1.003
mu_all 8.147 5.085 -0.02195 16.16 715.3 1.008
sigma_all 6.489 4.954 1.294 15.59 433 1.004
u8schools2 %>% stanfit() %>% mcmc_areas(pars = vars(-lp__), prob = 0.9)

u8schools2 %>% stanfit() %>% 
  mcmc_intervals_data(prob_outer = 0.95, prob = 0.50) %>%
  select(-matches("width")) %>%
  filter(! grepl("log_lik|lp", parameter)) %>%
  pander()
parameter point_est ll l m h hh
b[1] median -1.708 6.413 10.7 15.66 30.54
b[2] median -5.229 4.015 8.18 12.31 21.24
b[3] median -10.87 2.481 7.152 11.14 20.24
b[4] median -6.093 3.786 8.021 12.05 20.94
b[5] median -8.201 1.625 5.973 9.925 17.18
b[6] median -8.641 2.399 6.974 10.8 19.28
b[7] median -1.234 6.36 10.27 14.84 25.26
b[8] median -7.266 4.148 8.565 13.23 24.88
mu_all median -2.223 4.923 8.213 11.41 18.3
sigma_all median 0.9852 2.863 5.212 8.704 19.04
Areas <-
  bind_rows(
    u8schools %>% stanfit() %>% mcmc_areas_data(pars = vars(-lp__)) %>%
      mutate(model = "first model"),
    u8schools2 %>% stanfit() %>% mcmc_areas_data(pars = vars(-lp__)) %>%
      mutate(model = "second model")
  ) %>%
  filter(! grepl("log_lik|lp", parameter)) 

head(Areas, 2)
## # A tibble: 2 x 8
##   parameter interval interval_width     x density scaled_density
##   <fct>     <chr>             <dbl> <dbl>   <dbl>          <dbl>
## 1 b[1]      inner               0.5  4.68  0.0611          0.792
## 2 b[1]      inner               0.5  4.69  0.0612          0.792
## # … with 2 more variables: plotting_density <dbl>, model <chr>
Areas %>%
  gf_line(density ~ x, color = ~ model, size = 0.8, alpha = 0.8) %>%
  gf_facet_grid(parameter ~ .) 

P1 <- u8schools %>% extract.samples() %>% as.data.frame() %>%
  mutate(bdiff = b.1 - b.3, model = "first model") 
P2 <- u8schools2 %>% extract.samples() %>% as.data.frame() %>%
  mutate(bdiff = b.1 - b.3, model = "second model") 

gf_density( ~ bdiff, fill = ~ model, data = P1) %>%
  gf_vline( xintercept = ~ mean(bdiff), color = ~ model, data = P1) %>%
  gf_density( ~ bdiff, fill = ~ model, data = P2) %>%
  gf_vline( xintercept = ~ mean(bdiff), color = ~ model, data = P2) %>%
  gf_labs(fill = "model", color = "model") %>%
  gf_facet_grid(model ~ .)

Which prior should we use?

  1. Qualitatively, there is not much difference (and that’s a good thing).

  2. One could argue that we are over regularizing with the first model. Let’s see what WAIC thinks about the quality of our predictions.

    compare(u8schools, u8schools2) 
    ##                WAIC       SE     dWAIC       dSE     pWAIC    weight
    ## u8schools  61.24938 2.241604 0.0000000        NA 0.8780687 0.5879162
    ## u8schools2 61.96009 1.724926 0.7107158 0.6982955 1.2578931 0.4120838
    compare(u8schools, u8schools2) %>% plot()

These are quite similar, but our first model, the one with a more concentrated prior actually fairs just a touch better (in terms of estimated out-of-sample predictions).

There isn’t strong evidence to prefer one model over the other.
And that’s probably what we want here.
If these models were different in some substantial way, then we would have to think harder about which prior is the correct one to use. But since this isn’t our area of expertise, that might be hard for us to do.

The point here is that getting the order of magnitude right is often good enough when selecting priors – at least in terms of the qualitative answers we get from our models. Also, small changes in posterior means or medians (small relative to the variability) should not distract us.

From the blog post:

One can play around with a few different prior distributions for the standard deviation of true effects and find that it doesn’t really matter, any reasonable choice that doesn’t force the standard deviation to be very small or very large turns out to give about the same statistical distribution. School A ends up with an estimated true effect of 10 points, with about 50% of the probability between 7 and 16 points.

Measurement error in other models

Including measurement error is largely independent of the type of pooling we include in the model. Here are the complete pooling and no pooling versions of our 8 schools model.

# complete pooling
u8schools3 <-
  ulam(
    data = Schools %>% select(school_idx, improvement, stderr),
    alist(
      improvement ~ dnorm(mu, stderr),
      mu ~ dnorm(0, 20)
    ),
    iter = 2000, chains = 4, cores = 4, refresh = 0,
    file = "fits/u8schools3", log_lik = TRUE
  )

# no pooling
u8schools4 <-
  ulam(
    data = Schools %>% select(school_idx, improvement, stderr),
    alist(
      improvement ~ dnorm(mu, stderr),
      mu <- b[school_idx],
      b[school_idx] ~ dnorm(0, 20)
    ),
    iter = 2000, chains = 4, cores = 4, refresh = 0,
    file = "fits/u8schools4", log_lik = TRUE
  )

u8schools5 <-
  ulam(
    data = Schools %>% select(school_idx, improvement, stderr),
    alist(
      improvement ~ dnorm(mu, stderr),
      mu <- b[school_idx],
      b[school_idx] ~ dnorm(0, 40)
    ),
    iter = 2000, chains = 4, cores = 4, refresh = 0,
    file = "fits/u8schools5", log_lik = TRUE
  )

Note that this allows us to estimate (with uncertainty) the effect for school A (or any other schools, or a difference between two schools). Of course, in the complete pooling model, all the schools are the same, and the differences are 0.

precis(u8schools3)
##        mean       sd     5.5%    94.5%    n_eff    Rhat4
## mu 7.556394 4.065702 1.113199 13.96752 1329.224 1.001466
precis(u8schools4, depth = 2)
##            mean        sd       5.5%    94.5%    n_eff     Rhat4
## b[1] 18.2865915 12.446227  -1.904896 38.38266 5705.538 0.9995742
## b[2]  6.3255030  9.267125  -8.518043 21.05874 4979.516 0.9994022
## b[3] -1.6997436 12.432682 -21.761708 17.67470 5898.058 0.9995118
## b[4]  5.0682235  9.764103 -10.873938 20.53662 4109.800 0.9999537
## b[5] -0.3545132  8.574015 -13.974949 13.63637 5494.002 0.9994695
## b[6]  0.5466503  9.812642 -14.974969 16.44502 5028.536 1.0001294
## b[7] 14.0953876  9.249342  -0.809342 28.64410 4724.927 1.0007872
## b[8]  7.0081804 13.239239 -13.823474 28.79059 4942.320 0.9997140
precis(u8schools5, depth = 2)
##            mean        sd        5.5%    94.5%    n_eff     Rhat4
## b[1] 25.0473247 13.617041   3.2200260 46.81730 7045.893 0.9998858
## b[2]  7.2705522  9.984556  -8.6329755 23.33464 5622.836 0.9999941
## b[3] -2.1537682 15.420761 -26.5947207 22.86916 6281.495 0.9993766
## b[4]  6.4498125 10.519864 -10.0365753 23.01685 6884.271 0.9991850
## b[5] -0.6802965  9.193157 -15.0230689 14.26606 6492.049 0.9995829
## b[6]  0.6758210 10.642925 -16.3105821 17.40443 5918.376 0.9993260
## b[7] 16.6929190 10.003549   0.6073231 32.55180 5969.454 0.9994803
## b[8] 10.3439306 16.117343 -15.2647584 35.47851 6050.848 0.9993484

Another Example

See Rethinking 15.1 for a similar example using our Waffle House data.

Measurement Errors and DAGs

dag2 <-
  dagify(
  M ~ A,
  D ~ M + A,
  Dobs ~ D + e_D,
  Mobs ~ M + e_M,
  coords = coords)

gg_dag(dag2, highlight = c("D", "M", "e_D", "e_M"), size = 15)

dag3 <-
  dagify(
  M ~ A,
  D ~ M + A,
  Dobs ~ D + e_D + E,
  Mobs ~ M + e_M + E,
  coords = coords)

gg_dag(dag3, highlight = c("D", "M", "E", "e_D", "e_M"), size = 15)