This example can be found in several locations, including in this blog post entitled “Everything I need to know about Bayesian Statistics, I learned in eight Schools.”
Here is a summary of the study from Gelman et al. (1997), Sec. 5.5:
A study was performed for the Educational Testing Service to analyze the effects of special coaching programs for SAT-V (Scholastic Aptitude Test-Verbal) in each of eight high schools. The outcome variable in each study was the score on a special administration of the SAT-V, a standardized multiple choice test administered by the Educational Testing Service and used to help colleges make admissions decisions; the scores can vary between 200 and 800, with mean about 500 and standard deviation about 100. The SAT examinations are designed to be resistant to short-term efforts directed specifically toward improving performance on the test; instead they are designed to reflect knowledge acquired and abilities developed over many years of education. Nevertheless, each of the eight schools in this study considered its short-term coaching program to be very successful at increasing SAT scores. Also, there was no prior reason to believe that any of the eight programs was more effective than any other or that some were more similar in effect to each other than to any other."
And here are the data – summarised at the school level.
school | improvement | stderr |
---|---|---|
A | 28.39 | 14.9 |
B | 7.94 | 10.2 |
C | -2.75 | 16.3 |
D | 6.82 | 11 |
E | -0.64 | 9.4 |
F | 0.63 | 11.4 |
G | 18.01 | 10.4 |
H | 12.16 | 17.6 |
Does the coaching help?
Are scores reliably higher – on average – after the coaching?
Is there a school effect?
Do some schools do a better job of providing the coaching than others? Or perhaps is the is that the students differ and the students at some schools are helped more by the coaching.
Let’s load the data into R and create an index variable for the school.
Schools <- tribble(
~school, ~improvement, ~stderr,
"A", 28.39, 14.9,
"B", 7.94, 10.2,
"C", -2.75, 16.3,
"D", 6.82, 11.0,
"E", -0.64, 9.4,
"F", 0.63, 11.4,
"G", 18.01, 10.4,
"H", 12.16, 17.6
)
Schools <-
Schools %>%
mutate(
school_idx = as.integer(factor(school))
)
no pooling: School A improves by ~ 28 points
complete pooling: School A is like any other school, so improvement is ~ 8 points
mean( ~improvement, data = Schools)
## [1] 8.82
We should take the uncertainties into account when computing this mean, but a crude estimate suffices for the moment.
partial pooling: School A is similar to other schools, but not completely so, estimated improvement should be somewhere between 8 and 28 points.
We’d also like to knwo something about the uncertainty of each of these estimates. We have the standard error for school A to help in the first estimate. If we fit models to make these estimates, then we get the uncertainties from our models.
One way to think of these data is in terms of measurement error. We don’t know the mean improvement for all students at each school, we only know the mean for the students in the study. The standard error tells us how reliabel that estimate is (based on the sample size and the student to student variability).
So we can think of the improvement score as being sampled from some distribution, centered at the true improvement for a school, but with some uncertainty quantified by the standard error.
\[ \mbox{observed improvement} \sim {\sf Dist}(\mbox{true improvement}, \mbox{standard error}) \]
Let’s try modeling this with a normal distribution centered at the true improvement with standard deviation given by the standandard error. (This is a pretty traditional model for this sort of uncertainty, but it isn’t the only way we could do it.)
u8schools <-
ulam(
data = Schools %>% select(school_idx, improvement, stderr),
alist(
improvement ~ dnorm(mu, stderr), # observed improvement
mu <- b[school_idx], # each school has a true improvement
b[school_idx] ~ dnorm(mu_all, sigma_all), # distribution over schools
mu_all ~ dnorm(0, 10), # prior for average improvemnt over schools
sigma_all ~ dexp(0.1) # prior for variabilty from school to school
),
iter = 2000, chains = 4, cores = 4, refresh = 0,
file = "fits/u8schools", log_lik = TRUE
)
precis(u8schools, depth = 2) %>% pander()
mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|
b[1] | 8.702 | 6.316 | -0.004693 | 19.69 | 505.8 | 1.002 |
b[2] | 6.684 | 5.142 | -1.369 | 14.8 | 652.9 | 1.004 |
b[3] | 5.648 | 6.057 | -4.421 | 14.49 | 616.7 | 1.006 |
b[4] | 6.51 | 5.304 | -1.808 | 14.75 | 734.3 | 1.006 |
b[5] | 5.183 | 5.441 | -4.192 | 13.12 | 433.6 | 1.007 |
b[6] | 5.622 | 5.611 | -4.07 | 14.05 | 548.7 | 1.008 |
b[7] | 8.353 | 5.534 | 0.1978 | 17.48 | 423.3 | 1.006 |
b[8] | 6.974 | 5.934 | -1.855 | 16.73 | 667.2 | 1.008 |
mu_all | 6.467 | 4.005 | -0.04489 | 12.58 | 351.2 | 1.011 |
sigma_all | 4.259 | 3.687 | 0.5898 | 11.14 | 169 | 1.01 |
u8schools %>% stanfit() %>% mcmc_areas(pars = vars(-lp__))
u8schools %>% stanfit() %>%
mcmc_intervals_data(prob_outer = 0.95, prob = 0.50) %>%
select(-matches("width")) %>%
filter(! grepl("log_lik|lp", parameter)) %>%
pander()
parameter | point_est | ll | l | m | h | hh |
---|---|---|---|---|---|---|
b[1] | median | -1.901 | 4.682 | 8.112 | 11.81 | 23.95 |
b[2] | median | -3.414 | 3.399 | 6.692 | 9.782 | 17.02 |
b[3] | median | -8.423 | 2.301 | 6.047 | 9.48 | 16.85 |
b[4] | median | -4.342 | 3.365 | 6.547 | 9.771 | 16.69 |
b[5] | median | -7.457 | 2.146 | 5.691 | 8.74 | 14.72 |
b[6] | median | -7.163 | 2.473 | 6.117 | 9.286 | 15.95 |
b[7] | median | -1.34 | 4.858 | 7.932 | 11.35 | 20.93 |
b[8] | median | -4.582 | 3.516 | 6.822 | 9.998 | 20 |
mu_all | median | -1.72 | 3.926 | 6.576 | 9.092 | 14.32 |
sigma_all | median | 0.4905 | 1.516 | 3.189 | 5.925 | 14.07 |
One way to see how much your choice of prior matters is to refit the model with a different prior to see how much the model changes. Let’s fit a model with less concentrated priors.
u8schools2 <-
ulam(
data = Schools %>% select(school_idx, improvement, stderr),
alist(
improvement ~ dnorm(mu, stderr),
mu <- b[school_idx],
b[school_idx] ~ dnorm(mu_all, sigma_all),
mu_all ~ dnorm(0, 40),
sigma_all ~ dexp(0.025)
),
iter = 2000, chains = 4, cores = 4, refresh = 0,
file = "fits/u8schools2", log_lik = TRUE
)
coeftab(u8schools, u8schools2) %>% plot()
precis(u8schools2, depth = 2) %>% pander()
mean | sd | 5.5% | 94.5% | n_eff | Rhat4 | |
---|---|---|---|---|---|---|
b[1] | 11.58 | 7.821 | 0.7382 | 25.34 | 1030 | 1.004 |
b[2] | 8.102 | 6.577 | -2.412 | 18.39 | 1205 | 1.002 |
b[3] | 6.5 | 7.507 | -6.419 | 17.33 | 1059 | 1.003 |
b[4] | 7.885 | 6.708 | -2.786 | 18.31 | 1235 | 1.002 |
b[5] | 5.583 | 6.396 | -5.237 | 15.29 | 764.4 | 1.006 |
b[6] | 6.39 | 6.946 | -5.236 | 16.65 | 979.1 | 1.003 |
b[7] | 10.82 | 6.661 | 1.067 | 22.4 | 949.9 | 1.004 |
b[8] | 8.692 | 7.819 | -3.373 | 21.14 | 1504 | 1.003 |
mu_all | 8.147 | 5.085 | -0.02195 | 16.16 | 715.3 | 1.008 |
sigma_all | 6.489 | 4.954 | 1.294 | 15.59 | 433 | 1.004 |
u8schools2 %>% stanfit() %>% mcmc_areas(pars = vars(-lp__), prob = 0.9)
u8schools2 %>% stanfit() %>%
mcmc_intervals_data(prob_outer = 0.95, prob = 0.50) %>%
select(-matches("width")) %>%
filter(! grepl("log_lik|lp", parameter)) %>%
pander()
parameter | point_est | ll | l | m | h | hh |
---|---|---|---|---|---|---|
b[1] | median | -1.708 | 6.413 | 10.7 | 15.66 | 30.54 |
b[2] | median | -5.229 | 4.015 | 8.18 | 12.31 | 21.24 |
b[3] | median | -10.87 | 2.481 | 7.152 | 11.14 | 20.24 |
b[4] | median | -6.093 | 3.786 | 8.021 | 12.05 | 20.94 |
b[5] | median | -8.201 | 1.625 | 5.973 | 9.925 | 17.18 |
b[6] | median | -8.641 | 2.399 | 6.974 | 10.8 | 19.28 |
b[7] | median | -1.234 | 6.36 | 10.27 | 14.84 | 25.26 |
b[8] | median | -7.266 | 4.148 | 8.565 | 13.23 | 24.88 |
mu_all | median | -2.223 | 4.923 | 8.213 | 11.41 | 18.3 |
sigma_all | median | 0.9852 | 2.863 | 5.212 | 8.704 | 19.04 |
Areas <-
bind_rows(
u8schools %>% stanfit() %>% mcmc_areas_data(pars = vars(-lp__)) %>%
mutate(model = "first model"),
u8schools2 %>% stanfit() %>% mcmc_areas_data(pars = vars(-lp__)) %>%
mutate(model = "second model")
) %>%
filter(! grepl("log_lik|lp", parameter))
head(Areas, 2)
## # A tibble: 2 x 8
## parameter interval interval_width x density scaled_density
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 b[1] inner 0.5 4.68 0.0611 0.792
## 2 b[1] inner 0.5 4.69 0.0612 0.792
## # … with 2 more variables: plotting_density <dbl>, model <chr>
Areas %>%
gf_line(density ~ x, color = ~ model, size = 0.8, alpha = 0.8) %>%
gf_facet_grid(parameter ~ .)
P1 <- u8schools %>% extract.samples() %>% as.data.frame() %>%
mutate(bdiff = b.1 - b.3, model = "first model")
P2 <- u8schools2 %>% extract.samples() %>% as.data.frame() %>%
mutate(bdiff = b.1 - b.3, model = "second model")
gf_density( ~ bdiff, fill = ~ model, data = P1) %>%
gf_vline( xintercept = ~ mean(bdiff), color = ~ model, data = P1) %>%
gf_density( ~ bdiff, fill = ~ model, data = P2) %>%
gf_vline( xintercept = ~ mean(bdiff), color = ~ model, data = P2) %>%
gf_labs(fill = "model", color = "model") %>%
gf_facet_grid(model ~ .)
Qualitatively, there is not much difference (and that’s a good thing).
One could argue that we are over regularizing with the first model. Let’s see what WAIC thinks about the quality of our predictions.
compare(u8schools, u8schools2)
## WAIC SE dWAIC dSE pWAIC weight
## u8schools 61.24938 2.241604 0.0000000 NA 0.8780687 0.5879162
## u8schools2 61.96009 1.724926 0.7107158 0.6982955 1.2578931 0.4120838
compare(u8schools, u8schools2) %>% plot()
These are quite similar, but our first model, the one with a more concentrated prior actually fairs just a touch better (in terms of estimated out-of-sample predictions).
There isn’t strong evidence to prefer one model over the other.
And that’s probably what we want here.
If these models were different in some substantial way, then we would have to think harder about which prior is the correct one to use. But since this isn’t our area of expertise, that might be hard for us to do.
The point here is that getting the order of magnitude right is often good enough when selecting priors – at least in terms of the qualitative answers we get from our models. Also, small changes in posterior means or medians (small relative to the variability) should not distract us.
From the blog post:
One can play around with a few different prior distributions for the standard deviation of true effects and find that it doesn’t really matter, any reasonable choice that doesn’t force the standard deviation to be very small or very large turns out to give about the same statistical distribution. School A ends up with an estimated true effect of 10 points, with about 50% of the probability between 7 and 16 points.
Including measurement error is largely independent of the type of pooling we include in the model. Here are the complete pooling and no pooling versions of our 8 schools model.
# complete pooling
u8schools3 <-
ulam(
data = Schools %>% select(school_idx, improvement, stderr),
alist(
improvement ~ dnorm(mu, stderr),
mu ~ dnorm(0, 20)
),
iter = 2000, chains = 4, cores = 4, refresh = 0,
file = "fits/u8schools3", log_lik = TRUE
)
# no pooling
u8schools4 <-
ulam(
data = Schools %>% select(school_idx, improvement, stderr),
alist(
improvement ~ dnorm(mu, stderr),
mu <- b[school_idx],
b[school_idx] ~ dnorm(0, 20)
),
iter = 2000, chains = 4, cores = 4, refresh = 0,
file = "fits/u8schools4", log_lik = TRUE
)
u8schools5 <-
ulam(
data = Schools %>% select(school_idx, improvement, stderr),
alist(
improvement ~ dnorm(mu, stderr),
mu <- b[school_idx],
b[school_idx] ~ dnorm(0, 40)
),
iter = 2000, chains = 4, cores = 4, refresh = 0,
file = "fits/u8schools5", log_lik = TRUE
)
Note that this allows us to estimate (with uncertainty) the effect for school A (or any other schools, or a difference between two schools). Of course, in the complete pooling model, all the schools are the same, and the differences are 0.
precis(u8schools3)
## mean sd 5.5% 94.5% n_eff Rhat4
## mu 7.556394 4.065702 1.113199 13.96752 1329.224 1.001466
precis(u8schools4, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## b[1] 18.2865915 12.446227 -1.904896 38.38266 5705.538 0.9995742
## b[2] 6.3255030 9.267125 -8.518043 21.05874 4979.516 0.9994022
## b[3] -1.6997436 12.432682 -21.761708 17.67470 5898.058 0.9995118
## b[4] 5.0682235 9.764103 -10.873938 20.53662 4109.800 0.9999537
## b[5] -0.3545132 8.574015 -13.974949 13.63637 5494.002 0.9994695
## b[6] 0.5466503 9.812642 -14.974969 16.44502 5028.536 1.0001294
## b[7] 14.0953876 9.249342 -0.809342 28.64410 4724.927 1.0007872
## b[8] 7.0081804 13.239239 -13.823474 28.79059 4942.320 0.9997140
precis(u8schools5, depth = 2)
## mean sd 5.5% 94.5% n_eff Rhat4
## b[1] 25.0473247 13.617041 3.2200260 46.81730 7045.893 0.9998858
## b[2] 7.2705522 9.984556 -8.6329755 23.33464 5622.836 0.9999941
## b[3] -2.1537682 15.420761 -26.5947207 22.86916 6281.495 0.9993766
## b[4] 6.4498125 10.519864 -10.0365753 23.01685 6884.271 0.9991850
## b[5] -0.6802965 9.193157 -15.0230689 14.26606 6492.049 0.9995829
## b[6] 0.6758210 10.642925 -16.3105821 17.40443 5918.376 0.9993260
## b[7] 16.6929190 10.003549 0.6073231 32.55180 5969.454 0.9994803
## b[8] 10.3439306 16.117343 -15.2647584 35.47851 6050.848 0.9993484
See Rethinking 15.1 for a similar example using our Waffle House data.
dag2 <-
dagify(
M ~ A,
D ~ M + A,
Dobs ~ D + e_D,
Mobs ~ M + e_M,
coords = coords)
gg_dag(dag2, highlight = c("D", "M", "e_D", "e_M"), size = 15)
dag3 <-
dagify(
M ~ A,
D ~ M + A,
Dobs ~ D + e_D + E,
Mobs ~ M + e_M + E,
coords = coords)
gg_dag(dag3, highlight = c("D", "M", "E", "e_D", "e_M"), size = 15)