This example is loosely based on “Multilevel (Hierarchical) Modeling: What it Can and Cannot Do” by Andrew Gelman
data("radon", package = "rstanarm")
# Stan doesn't allow a variable to be named floor:
Radon <- radon %>%
mutate(basement = 1 - floor) %>%
select(-floor)
Radon %>% slice_sample(n=5) %>% pander()
county | log_radon | log_uranium | basement |
---|---|---|---|
COOK | 0.4055 | -0.505 | 1 |
ANOKA | 1.099 | -0.8473 | 1 |
STLOUIS | 0.47 | -0.4747 | 1 |
STLOUIS | 0.6419 | -0.4747 | 1 |
DAKOTA | -0.5108 | -0.02415 | 1 |
In that article, the model is described like this:
We might write that more like
\[\begin{align*} y_i & \sim {\sf Norm}(\mu, \sigma_y) \\ \mu &= \alpha_{c_i} + \beta x_i \\ \alpha_{c_i} &\sim {\sf Norm}(\gamma_0 + \gamma_1 u_{c_i}, \sigma_\alpha) \end{align*}\]
Of course, we still need priors for our parameters: \(\sigma\), \(\alpha_j\), \(\beta\), \(\gamma_0\), and \(\gamma_1\). (In the paper they use uniform priors, which probably isn’t the best idea, and even the author now recommends against that.)
Let’s rewrite this some more:
\[\begin{align*} y_i & \sim {\sf Norm}(\mu, \sigma_y) \\ \mu &= \alpha_{c_i} + \beta x_i \\ \alpha_{c_i} &\sim \gamma_0 + \gamma_1 u_{c_i} + {\sf Norm}(0, \sigma_\alpha) \end{align*}\]
And one more time:
\[\begin{align*} y_i & \sim {\sf Norm}(\mu, \sigma_y) \\ \mu &= \gamma_0 + \gamma_1 u + \tilde{\alpha}_{\alpha_{c_i}} + \beta x_i \\ \tilde{\alpha}_{c_i} &\sim {\sf Norm}(0, \sigma_\alpha) \end{align*}\]
ulam()
We could write this part of our model for ulam()
as something like this:
alist(
y ~ dnorm(mu, sigma),
mu <- a0 + b * x + g * u + a[c],
a[c] ~ dnorm(0, sigma_county),
...
)
# or with real variable names
alist(
log_radon ~ dnorm(mu, sigma),
mu <- a0 + b * basement + g * log_uranium + a[county],
a[county] ~ dnorm(0, sigma_county),
...
)
assuming that
county
is an index variable,
basement
is an indicator variable, and
u
(log_uranium
) has the same value for all homes in the same county – as it does in our data set: it’s the county average level.
(Alternatively, we could have keep the data in a list rather than a data frame. Then we would only need to record the value of u
once for each county, but we would need to do some additional indexing.)
And here’s the model fit with ulam()
:
u1 <-
ulam(
data = Radon %>%
mutate(county_idx = as.numeric(factor(county))) %>%
select(log_radon, county_idx, log_uranium, basement),
alist(
log_radon ~ dnorm(mu, sigma),
mu <- a0 + b * basement + c * log_uranium + a[county_idx],
vector[85]: a ~ dnorm(0, sigma_county), # prior for a[county_idx]; 85 counties
a0 ~ dnorm(0, 2),
b ~ dnorm(0, 3),
c ~ dnorm(0, 1),
sigma ~ dexp(1),
sigma_county ~ dexp(1)
),
iter = 4000, chains = 4, cores = 4, refresh = 0,
file = "fits/u1-radon"
)
precis(u1)
## 85 vector or matrix parameters hidden. Use depth=2 to show them.
## mean sd 5.5% 94.5% n_eff Rhat4
## a0 0.8546605 0.06317660 0.75394679 0.9551476 4700.2412 1.001266
## b 0.6391118 0.06637379 0.53356667 0.7452533 4418.3509 1.000485
## c 0.6943569 0.08975047 0.55269916 0.8377057 3233.6100 1.001708
## sigma 0.7296999 0.01766289 0.70168664 0.7583951 3961.6272 1.000905
## sigma_county 0.1458507 0.05005031 0.06588151 0.2272323 219.3592 1.012524
Some tuning or reparameterization might help us avoid the warnings Stan gives for this model, but let’s move on to something new…
Let’s fit this model again using brms – bayesian regression modeling with stan. This package provides a way to briefly describe models like this and takes care of some bookkeeping for us (like creating index variables, dealing with factors, removing missing data values, etc.)
The model is described with a formula and a prior. The formula puts the response on the left side and the predictors on the right side:
log_radon ~ 1 + basement + log_uranium
# alternatively
log_radon ~ basement + log_uranium # the 1 + is assumed; use -1 or 0 to get rid of intercept
This says that there is a coefficient/parameter for each of these variables:
log_radon = a * 1 + b * basement + c * log_uranium
This isn’t yet our model from above, but brm()
is happy to fit it for us, so let’s give that a try:
library(brms)
b1 <-
brm(log_radon ~ 1 + basement + log_uranium, data = Radon,
file = "fits/b1-radon")
Notice that there are several things we have not specified:
brms has provided some default values for those. Let’s see if we can figure out what they are from the output below and that we understand what the parameters are.
b1
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: log_radon ~ 1 + basement + log_uranium
## Data: Radon (Number of observations: 919)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.85 0.06 0.73 0.97 1.00 4466 2890
## basement 0.62 0.07 0.49 0.75 1.00 4208 3001
## log_uranium 0.75 0.07 0.62 0.89 1.00 4188 3078
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.74 0.02 0.71 0.78 1.00 4757 2777
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Family: gaussian
– that’s a normal distributionLinks: mu = identity
– using the identity linkPopulation-Level Effects
– the 3 main paramters of our modelFamily Specific Parameters
– here’s where we find the sigma
for our normal distributionbayesplot can work directly with brms objects:
b1 %>% mcmc_areas(pars = vars(-lp__))
b1 %>% mcmc_pairs(pars = vars(-lp__))
But what about the priors?
b1 %>% prior_summary()
## prior class coef group resp dpar nlpar bound
## (flat) b
## (flat) b basement
## (flat) b log_uranium
## student_t(3, 1.3, 2.5) Intercept
## student_t(3, 0, 2.5) sigma
## source
## default
## (vectorized)
## (vectorized)
## default
## default
Let’s change the priors:
df_stats( ~ log_radon + log_uranium, data = Radon)
## response min Q1 median Q3 max mean
## 1 log_radon -2.3025851 0.6931472 1.30833282 1.8082888 3.8774316 1.2647792
## 2 log_uranium -0.8818289 -0.4746737 -0.09652081 0.1832378 0.5280249 -0.1317092
## sd n missing
## 1 0.8193550 919 0
## 2 0.3653242 919 0
prior1 <-
prior(normal(0, 3), class = "b", coef = "basement") +
prior(normal(0, 1), class = "b", coef = "log_uranium")
prior1
## prior class coef group resp dpar nlpar bound source
## normal(0, 3) b basement user
## normal(0, 1) b log_uranium user
# same model as before, but with a different prior
# just saves some typing here, but if a recompile isn't needed, it can save execution time
b2 <- update(b1, prior = prior1, file = "fits/b2-radon")
## The desired updates require recompiling the model
b2
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: log_radon ~ 1 + basement + log_uranium
## Data: Radon (Number of observations: 919)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.85 0.06 0.72 0.97 1.00 4110 3115
## basement 0.62 0.07 0.49 0.75 1.00 4105 2954
## log_uranium 0.75 0.07 0.62 0.88 1.00 4153 3278
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.74 0.02 0.71 0.78 1.00 4628 3248
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Now we add on a bit to tell R about the multi-level part. We want an intercept/adjustment for each county:
ml_formula <- log_radon ~ 1 + basement + log_uranium + (1 | county)
This says that we want an intercept for each county.
get_prior(ml_formula, data = Radon)
## prior class coef group resp dpar nlpar bound
## (flat) b
## (flat) b basement
## (flat) b log_uranium
## student_t(3, 1.3, 2.5) Intercept
## student_t(3, 0, 2.5) sd
## student_t(3, 0, 2.5) sd county
## student_t(3, 0, 2.5) sd Intercept county
## student_t(3, 0, 2.5) sigma
## source
## default
## (vectorized)
## (vectorized)
## default
## default
## (vectorized)
## (vectorized)
## default
We can live with those defaults for now the mutlilevel part (class sd
), but we’ll reuse prior1
for the population-level parameters.
b3 <-
brm(ml_formula, data = Radon,
prior =
prior(normal(0, 3), class = "b", coef = "basement") +
prior(normal(0, 1), class = "b", coef = "log_uranium"),
control = list(adapt_delta = 0.95),
cores = 4, chains = 4,
file = "fits/b3-radon")
b3 %>% prior_summary()
## prior class coef group resp dpar nlpar bound
## (flat) b
## normal(0, 3) b basement
## normal(0, 1) b log_uranium
## student_t(3, 1.3, 2.5) Intercept
## student_t(3, 0, 2.5) sd
## student_t(3, 0, 2.5) sd county
## student_t(3, 0, 2.5) sd Intercept county
## student_t(3, 0, 2.5) sigma
## source
## default
## user
## user
## default
## default
## (vectorized)
## (vectorized)
## default
b3
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: log_radon ~ 1 + basement + log_uranium + (1 | county)
## Data: Radon (Number of observations: 919)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~county (Number of levels: 85)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.15 0.05 0.06 0.25 1.00 1016 1505
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.86 0.07 0.73 0.98 1.00 6366 2964
## basement 0.64 0.07 0.50 0.77 1.00 7346 2914
## log_uranium 0.69 0.09 0.51 0.87 1.00 3426 3174
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.73 0.02 0.70 0.76 1.00 5239 3196
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Now we have a Group-Level Effects
section, which is new. Let’s look at something more familiar – posterior sammples!
b3 %>% posterior_samples() %>% class()
## [1] "data.frame"
b3 %>% posterior_samples() %>% dim()
## [1] 4000 91
# we don't need to see all the counties
b3 %>% posterior_summary() %>% head(15)
## Estimate Est.Error Q2.5 Q97.5
## b_Intercept 0.856738040 0.06518934 0.72809885 0.9826046
## b_basement 0.637987229 0.06752767 0.50368498 0.7717756
## b_log_uranium 0.691784676 0.08931374 0.51305657 0.8662259
## sd_county__Intercept 0.151060892 0.04779069 0.06275251 0.2482776
## sigma 0.729380517 0.01744782 0.69654177 0.7649453
## r_county[AITKIN,Intercept] -0.022035847 0.14170239 -0.31267185 0.2586196
## r_county[ANOKA,Intercept] 0.007649523 0.09846847 -0.19513776 0.2000748
## r_county[BECKER,Intercept] 0.012713290 0.14122572 -0.28019722 0.2990963
## r_county[BELTRAMI,Intercept] 0.107635334 0.14028245 -0.14510062 0.4051824
## r_county[BENTON,Intercept] 0.008644129 0.14513270 -0.28962391 0.2950277
## r_county[BIGSTONE,Intercept] -0.029953961 0.14749329 -0.33889872 0.2529142
## r_county[BLUEEARTH,Intercept] 0.122622606 0.12451762 -0.09940510 0.3855848
## r_county[BROWN,Intercept] 0.042164195 0.14433075 -0.23048316 0.3448501
## r_county[CARLTON,Intercept] -0.068473684 0.13112441 -0.34203917 0.1766944
## r_county[CARVER,Intercept] -0.004774980 0.13645426 -0.27257347 0.2674546
Now we can see that we are indeed getting a parameter for each county. Phew! We can also see what all the parameters are named. The names are long, but very clear.
The biggest advantage of ulam()
(for leanring) is that it is very clear and explicit. Every part of the model is right there for you to see – and you need to think about every part of the model. That’s why we’ve been using it.
But if you ever do more Bayesian modeling, you might prefer to use brm()
because:
It is more widely used, better documented, more thoroughly tested, etc.
ulam()
will get harder to use as models become more complex.The user base is larger, so you are likely to find help more easily.
The syntax is similar to syntax used by lm()
and lmer()
to fit non-Bayesian models. So you can fit the same sorts of models in Bayesian and frequentist ways without learning more syntax.
There is a broader suite of additional tools for brms.
methods(class = "brmsfit")
## [1] add_criterion add_ic as.array
## [4] as.data.frame as.matrix as.mcmc
## [7] autocor bayes_factor bayes_R2
## [10] bridge_sampler coef conditional_effects
## [13] conditional_smooths control_params cv_varsel
## [16] expose_functions family fitted_draws
## [19] fitted fixef formula
## [22] get_refmodel getCall hypothesis
## [25] kfold launch_shinystan log_lik
## [28] log_posterior logLik loo_compare
## [31] loo_linpred loo_model_weights loo_moment_match
## [34] loo_predict loo_predictive_interval loo_R2
## [37] loo_subsample loo LOO
## [40] marginal_effects marginal_smooths mcmc_plot
## [43] model_weights model.frame neff_ratio
## [46] ngrps nobs nsamples
## [49] nuts_params pairs parnames
## [52] plot post_prob posterior_average
## [55] posterior_epred posterior_interval posterior_linpred
## [58] posterior_predict posterior_samples posterior_smooths
## [61] posterior_summary posterior pp_average
## [64] pp_check pp_mixture predict
## [67] predicted_draws predictive_error predictive_interval
## [70] prepare_predictions print prior_samples
## [73] prior_summary ranef reloo
## [76] residual_draws residuals rhat
## [79] stancode standata stanfit
## [82] stanplot summary tidy_draws
## [85] update VarCorr varsel
## [88] vcov waic WAIC
## see '?methods' for accessing help and source code
Here are just a few examples of the things we can do with our brmsfit object.
waic(b3)
## Warning:
## 7 (0.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.
##
## Computed from 4000 by 919 log-likelihood matrix
##
## Estimate SE
## elpd_waic -1026.9 28.8
## p_waic 26.3 2.3
## waic 2053.9 57.5
##
## 7 (0.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.
loo(b3)
##
## Computed from 4000 by 919 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1027.2 28.8
## p_loo 26.5 2.4
## looic 2054.3 57.6
## ------
## Monte Carlo SE of elpd_loo is 0.1.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
loo_compare(b2, b3)
## elpd_diff se_diff
## b3 0.0 0.0
## b2 -4.8 3.6
b3 <- b3 %>% add_criterion("loo")
b2 <- b2 %>% add_criterion("loo")
loo_compare(b2, b3)
## elpd_diff se_diff
## b3 0.0 0.0
## b2 -4.8 3.6
tidy_draws(b3) %>%
gf_density(~ b_basement)
mcmc_areas(b3, pars = c("b_basement", "b_log_uranium"))
predict(b3) %>% str()
## num [1:919, 1:4] 0.359 0.991 0.976 0.995 0.905 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:4] "Estimate" "Est.Error" "Q2.5" "Q97.5"
PRadon <- Radon %>% select(-log_radon) %>%
group_by(county, log_uranium, basement) %>% distinct()
PRadon %>% head(4)
## # A tibble: 4 x 3
## # Groups: county, log_uranium, basement [4]
## county log_uranium basement
## <fct> <dbl> <dbl>
## 1 AITKIN -0.689 0
## 2 AITKIN -0.689 1
## 3 ANOKA -0.847 1
## 4 ANOKA -0.847 0
predict(b3, newdata = PRadon) %>%
as.data.frame() %>%
bind_cols(PRadon) %>%
gf_pointrange(Estimate + Q2.5 + Q97.5 ~ log_uranium) %>%
gf_facet_grid(basement ~ .)
b3$model
## // generated with brms 2.14.4
## functions {
## }
## data {
## int<lower=1> N; // total number of observations
## vector[N] Y; // response variable
## int<lower=1> K; // number of population-level effects
## matrix[N, K] X; // population-level design matrix
## // data for group-level effects of ID 1
## int<lower=1> N_1; // number of grouping levels
## int<lower=1> M_1; // number of coefficients per level
## int<lower=1> J_1[N]; // grouping indicator per observation
## // group-level predictor values
## vector[N] Z_1_1;
## int prior_only; // should the likelihood be ignored?
## }
## transformed data {
## int Kc = K - 1;
## matrix[N, Kc] Xc; // centered version of X without an intercept
## vector[Kc] means_X; // column means of X before centering
## for (i in 2:K) {
## means_X[i - 1] = mean(X[, i]);
## Xc[, i - 1] = X[, i] - means_X[i - 1];
## }
## }
## parameters {
## vector[Kc] b; // population-level effects
## real Intercept; // temporary intercept for centered predictors
## real<lower=0> sigma; // residual SD
## vector<lower=0>[M_1] sd_1; // group-level standard deviations
## vector[N_1] z_1[M_1]; // standardized group-level effects
## }
## transformed parameters {
## vector[N_1] r_1_1; // actual group-level effects
## r_1_1 = (sd_1[1] * (z_1[1]));
## }
## model {
## // likelihood including all constants
## if (!prior_only) {
## // initialize linear predictor term
## vector[N] mu = Intercept + rep_vector(0.0, N);
## for (n in 1:N) {
## // add more terms to the linear predictor
## mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
## }
## target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
## }
## // priors including all constants
## target += normal_lpdf(b[1] | 0, 3);
## target += normal_lpdf(b[2] | 0, 1);
## target += student_t_lpdf(Intercept | 3, 1.3, 2.5);
## target += student_t_lpdf(sigma | 3, 0, 2.5)
## - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## target += student_t_lpdf(sd_1 | 3, 0, 2.5)
## - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## target += std_normal_lpdf(z_1[1]);
## }
## generated quantities {
## // actual population-level intercept
## real b_Intercept = Intercept - dot_product(means_X, b);
## }
If you are curious to learn more, check out
Statistical rethinking with brms, ggplot2, and the tidyverse: Second edition by Solomon Kurz.
He goes through our entire text using brm()
instaed of ulam()
. His code style is much more like the style you have seen in this class, but he uses ggplot2 instead of ggformula and he sticks a little closer to McElreaths (bad) naming conventions. (Lots of things are called d
.)
Even if you don’t use brms, you might find the model formula notation handy for thinking about the response and predictors in your model.
We haven’t yet talked about how to use a different family (t, binomial, etc.) Perhaps we can take a look at a logistic regression in brms a different day.