This example is loosely based on “Multilevel (Hierarchical) Modeling: What it Can and Cannot Do” by Andrew Gelman

Radon Data

data("radon", package = "rstanarm")

# Stan doesn't allow a variable to be named floor:

Radon <- radon %>%
  mutate(basement = 1 - floor) %>%
  select(-floor)

Radon %>% slice_sample(n=5) %>% pander()
county log_radon log_uranium basement
COOK 0.4055 -0.505 1
ANOKA 1.099 -0.8473 1
STLOUIS 0.47 -0.4747 1
STLOUIS 0.6419 -0.4747 1
DAKOTA -0.5108 -0.02415 1

A radon model

In that article, the model is described like this:

We might write that more like

\[\begin{align*} y_i & \sim {\sf Norm}(\mu, \sigma_y) \\ \mu &= \alpha_{c_i} + \beta x_i \\ \alpha_{c_i} &\sim {\sf Norm}(\gamma_0 + \gamma_1 u_{c_i}, \sigma_\alpha) \end{align*}\]

Of course, we still need priors for our parameters: \(\sigma\), \(\alpha_j\), \(\beta\), \(\gamma_0\), and \(\gamma_1\). (In the paper they use uniform priors, which probably isn’t the best idea, and even the author now recommends against that.)

Let’s rewrite this some more:

\[\begin{align*} y_i & \sim {\sf Norm}(\mu, \sigma_y) \\ \mu &= \alpha_{c_i} + \beta x_i \\ \alpha_{c_i} &\sim \gamma_0 + \gamma_1 u_{c_i} + {\sf Norm}(0, \sigma_\alpha) \end{align*}\]

And one more time:

\[\begin{align*} y_i & \sim {\sf Norm}(\mu, \sigma_y) \\ \mu &= \gamma_0 + \gamma_1 u + \tilde{\alpha}_{\alpha_{c_i}} + \beta x_i \\ \tilde{\alpha}_{c_i} &\sim {\sf Norm}(0, \sigma_\alpha) \end{align*}\]

Fitting the model with ulam()

We could write this part of our model for ulam() as something like this:

alist(
  y ~ dnorm(mu, sigma),
  mu <- a0 + b * x + g * u + a[c],
  a[c] ~ dnorm(0, sigma_county),
  ...
)

# or with real variable names

alist(
  log_radon ~ dnorm(mu, sigma),
  mu <- a0 + b * basement + g * log_uranium + a[county],
  a[county] ~ dnorm(0, sigma_county),
  ...
)

assuming that

  • county is an index variable,

  • basement is an indicator variable, and

  • u (log_uranium) has the same value for all homes in the same county – as it does in our data set: it’s the county average level.

    (Alternatively, we could have keep the data in a list rather than a data frame. Then we would only need to record the value of u once for each county, but we would need to do some additional indexing.)

And here’s the model fit with ulam():

u1 <- 
  ulam(
    data = Radon %>% 
      mutate(county_idx = as.numeric(factor(county))) %>%
      select(log_radon, county_idx, log_uranium, basement),
    alist(
      log_radon ~ dnorm(mu, sigma),
      mu <- a0 + b * basement + c * log_uranium + a[county_idx],
      vector[85]: a ~ dnorm(0, sigma_county),     # prior for a[county_idx]; 85 counties
      a0 ~ dnorm(0, 2),
      b ~ dnorm(0, 3),
      c ~ dnorm(0, 1),
      sigma ~ dexp(1),
      sigma_county ~ dexp(1)
    ),
    iter = 4000, chains = 4, cores = 4, refresh = 0,
    file = "fits/u1-radon"
  ) 

precis(u1)
## 85 vector or matrix parameters hidden. Use depth=2 to show them.
##                   mean         sd       5.5%     94.5%     n_eff    Rhat4
## a0           0.8546605 0.06317660 0.75394679 0.9551476 4700.2412 1.001266
## b            0.6391118 0.06637379 0.53356667 0.7452533 4418.3509 1.000485
## c            0.6943569 0.08975047 0.55269916 0.8377057 3233.6100 1.001708
## sigma        0.7296999 0.01766289 0.70168664 0.7583951 3961.6272 1.000905
## sigma_county 0.1458507 0.05005031 0.06588151 0.2272323  219.3592 1.012524

Some tuning or reparameterization might help us avoid the warnings Stan gives for this model, but let’s move on to something new…

Fitting the model with brms

The basics

Let’s fit this model again using brmsbayesian regression modeling with stan. This package provides a way to briefly describe models like this and takes care of some bookkeeping for us (like creating index variables, dealing with factors, removing missing data values, etc.)

The model is described with a formula and a prior. The formula puts the response on the left side and the predictors on the right side:

log_radon ~ 1 + basement + log_uranium
# alternatively
log_radon ~ basement + log_uranium     # the 1 + is assumed; use -1 or 0 to get rid of intercept

This says that there is a coefficient/parameter for each of these variables:

log_radon = a * 1 + b * basement + c * log_uranium

This isn’t yet our model from above, but brm() is happy to fit it for us, so let’s give that a try:

library(brms)
b1 <- 
  brm(log_radon ~ 1 + basement + log_uranium, data = Radon,
       file = "fits/b1-radon")

Notice that there are several things we have not specified:

  • the priors,
  • the distribution for the marginal response,
  • a link function
  • tuning parameters for Stan

brms has provided some default values for those. Let’s see if we can figure out what they are from the output below and that we understand what the parameters are.

b1
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: log_radon ~ 1 + basement + log_uranium 
##    Data: Radon (Number of observations: 919) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept       0.85      0.06     0.73     0.97 1.00     4466     2890
## basement        0.62      0.07     0.49     0.75 1.00     4208     3001
## log_uranium     0.75      0.07     0.62     0.89 1.00     4188     3078
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.74      0.02     0.71     0.78 1.00     4757     2777
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
  • Family: gaussian – that’s a normal distribution
  • Links: mu = identity – using the identity link
  • Population-Level Effects – the 3 main paramters of our model
  • Family Specific Parameters – here’s where we find the sigma for our normal distribution
  • We can also see information about the number of chains, iterations, etc.

bayesplot can work directly with brms objects:

b1 %>% mcmc_areas(pars = vars(-lp__))

b1 %>% mcmc_pairs(pars = vars(-lp__))

But what about the priors?

b1 %>% prior_summary()
##                   prior     class        coef group resp dpar nlpar bound
##                  (flat)         b                                        
##                  (flat)         b    basement                            
##                  (flat)         b log_uranium                            
##  student_t(3, 1.3, 2.5) Intercept                                        
##    student_t(3, 0, 2.5)     sigma                                        
##        source
##       default
##  (vectorized)
##  (vectorized)
##       default
##       default

Let’s change the priors:

df_stats( ~ log_radon + log_uranium, data = Radon)
##      response        min         Q1      median        Q3       max       mean
## 1   log_radon -2.3025851  0.6931472  1.30833282 1.8082888 3.8774316  1.2647792
## 2 log_uranium -0.8818289 -0.4746737 -0.09652081 0.1832378 0.5280249 -0.1317092
##          sd   n missing
## 1 0.8193550 919       0
## 2 0.3653242 919       0
prior1 <-
  prior(normal(0, 3), class = "b", coef = "basement") + 
  prior(normal(0, 1), class = "b", coef = "log_uranium")
prior1
##         prior class        coef group resp dpar nlpar bound source
##  normal(0, 3)     b    basement                               user
##  normal(0, 1)     b log_uranium                               user
# same model as before, but with a different prior
# just saves some typing here, but if a recompile isn't needed, it can save execution time
b2 <- update(b1, prior = prior1, file = "fits/b2-radon")
## The desired updates require recompiling the model
b2
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: log_radon ~ 1 + basement + log_uranium 
##    Data: Radon (Number of observations: 919) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept       0.85      0.06     0.72     0.97 1.00     4110     3115
## basement        0.62      0.07     0.49     0.75 1.00     4105     2954
## log_uranium     0.75      0.07     0.62     0.88 1.00     4153     3278
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.74      0.02     0.71     0.78 1.00     4628     3248
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Adding the multilevel part

Now we add on a bit to tell R about the multi-level part. We want an intercept/adjustment for each county:

ml_formula <- log_radon ~ 1 +  basement + log_uranium + (1 | county)

This says that we want an intercept for each county.

get_prior(ml_formula, data = Radon)
##                   prior     class        coef  group resp dpar nlpar bound
##                  (flat)         b                                         
##                  (flat)         b    basement                             
##                  (flat)         b log_uranium                             
##  student_t(3, 1.3, 2.5) Intercept                                         
##    student_t(3, 0, 2.5)        sd                                         
##    student_t(3, 0, 2.5)        sd             county                      
##    student_t(3, 0, 2.5)        sd   Intercept county                      
##    student_t(3, 0, 2.5)     sigma                                         
##        source
##       default
##  (vectorized)
##  (vectorized)
##       default
##       default
##  (vectorized)
##  (vectorized)
##       default

We can live with those defaults for now the mutlilevel part (class sd), but we’ll reuse prior1 for the population-level parameters.

b3 <- 
  brm(ml_formula, data = Radon,
      prior = 
        prior(normal(0, 3), class = "b", coef = "basement") +
        prior(normal(0, 1), class = "b", coef = "log_uranium"),
      control = list(adapt_delta = 0.95),
      cores = 4, chains = 4,
      file = "fits/b3-radon")
b3 %>% prior_summary()
##                   prior     class        coef  group resp dpar nlpar bound
##                  (flat)         b                                         
##            normal(0, 3)         b    basement                             
##            normal(0, 1)         b log_uranium                             
##  student_t(3, 1.3, 2.5) Intercept                                         
##    student_t(3, 0, 2.5)        sd                                         
##    student_t(3, 0, 2.5)        sd             county                      
##    student_t(3, 0, 2.5)        sd   Intercept county                      
##    student_t(3, 0, 2.5)     sigma                                         
##        source
##       default
##          user
##          user
##       default
##       default
##  (vectorized)
##  (vectorized)
##       default
b3
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: log_radon ~ 1 + basement + log_uranium + (1 | county) 
##    Data: Radon (Number of observations: 919) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~county (Number of levels: 85) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.15      0.05     0.06     0.25 1.00     1016     1505
## 
## Population-Level Effects: 
##             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept       0.86      0.07     0.73     0.98 1.00     6366     2964
## basement        0.64      0.07     0.50     0.77 1.00     7346     2914
## log_uranium     0.69      0.09     0.51     0.87 1.00     3426     3174
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.73      0.02     0.70     0.76 1.00     5239     3196
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Now we have a Group-Level Effects section, which is new. Let’s look at something more familiar – posterior sammples!

b3 %>% posterior_samples() %>% class() 
## [1] "data.frame"
b3 %>% posterior_samples() %>% dim() 
## [1] 4000   91
# we don't need to see all the counties
b3 %>% posterior_summary() %>% head(15)   
##                                   Estimate  Est.Error        Q2.5     Q97.5
## b_Intercept                    0.856738040 0.06518934  0.72809885 0.9826046
## b_basement                     0.637987229 0.06752767  0.50368498 0.7717756
## b_log_uranium                  0.691784676 0.08931374  0.51305657 0.8662259
## sd_county__Intercept           0.151060892 0.04779069  0.06275251 0.2482776
## sigma                          0.729380517 0.01744782  0.69654177 0.7649453
## r_county[AITKIN,Intercept]    -0.022035847 0.14170239 -0.31267185 0.2586196
## r_county[ANOKA,Intercept]      0.007649523 0.09846847 -0.19513776 0.2000748
## r_county[BECKER,Intercept]     0.012713290 0.14122572 -0.28019722 0.2990963
## r_county[BELTRAMI,Intercept]   0.107635334 0.14028245 -0.14510062 0.4051824
## r_county[BENTON,Intercept]     0.008644129 0.14513270 -0.28962391 0.2950277
## r_county[BIGSTONE,Intercept]  -0.029953961 0.14749329 -0.33889872 0.2529142
## r_county[BLUEEARTH,Intercept]  0.122622606 0.12451762 -0.09940510 0.3855848
## r_county[BROWN,Intercept]      0.042164195 0.14433075 -0.23048316 0.3448501
## r_county[CARLTON,Intercept]   -0.068473684 0.13112441 -0.34203917 0.1766944
## r_county[CARVER,Intercept]    -0.004774980 0.13645426 -0.27257347 0.2674546

Now we can see that we are indeed getting a parameter for each county. Phew! We can also see what all the parameters are named. The names are long, but very clear.

Comparing brm() to ulam()

The biggest advantage of ulam() (for leanring) is that it is very clear and explicit. Every part of the model is right there for you to see – and you need to think about every part of the model. That’s why we’ve been using it.

But if you ever do more Bayesian modeling, you might prefer to use brm() because:

  • It is more widely used, better documented, more thoroughly tested, etc.

    • ulam() will get harder to use as models become more complex.
  • The user base is larger, so you are likely to find help more easily.

  • The syntax is similar to syntax used by lm() and lmer() to fit non-Bayesian models. So you can fit the same sorts of models in Bayesian and frequentist ways without learning more syntax.

  • There is a broader suite of additional tools for brms.

    methods(class = "brmsfit")
    ##  [1] add_criterion           add_ic                  as.array               
    ##  [4] as.data.frame           as.matrix               as.mcmc                
    ##  [7] autocor                 bayes_factor            bayes_R2               
    ## [10] bridge_sampler          coef                    conditional_effects    
    ## [13] conditional_smooths     control_params          cv_varsel              
    ## [16] expose_functions        family                  fitted_draws           
    ## [19] fitted                  fixef                   formula                
    ## [22] get_refmodel            getCall                 hypothesis             
    ## [25] kfold                   launch_shinystan        log_lik                
    ## [28] log_posterior           logLik                  loo_compare            
    ## [31] loo_linpred             loo_model_weights       loo_moment_match       
    ## [34] loo_predict             loo_predictive_interval loo_R2                 
    ## [37] loo_subsample           loo                     LOO                    
    ## [40] marginal_effects        marginal_smooths        mcmc_plot              
    ## [43] model_weights           model.frame             neff_ratio             
    ## [46] ngrps                   nobs                    nsamples               
    ## [49] nuts_params             pairs                   parnames               
    ## [52] plot                    post_prob               posterior_average      
    ## [55] posterior_epred         posterior_interval      posterior_linpred      
    ## [58] posterior_predict       posterior_samples       posterior_smooths      
    ## [61] posterior_summary       posterior               pp_average             
    ## [64] pp_check                pp_mixture              predict                
    ## [67] predicted_draws         predictive_error        predictive_interval    
    ## [70] prepare_predictions     print                   prior_samples          
    ## [73] prior_summary           ranef                   reloo                  
    ## [76] residual_draws          residuals               rhat                   
    ## [79] stancode                standata                stanfit                
    ## [82] stanplot                summary                 tidy_draws             
    ## [85] update                  VarCorr                 varsel                 
    ## [88] vcov                    waic                    WAIC                   
    ## see '?methods' for accessing help and source code

Here are just a few examples of the things we can do with our brmsfit object.

waic(b3)
## Warning: 
## 7 (0.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.
## 
## Computed from 4000 by 919 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic  -1026.9 28.8
## p_waic        26.3  2.3
## waic        2053.9 57.5
## 
## 7 (0.8%) p_waic estimates greater than 0.4. We recommend trying loo instead.
loo(b3)
## 
## Computed from 4000 by 919 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo  -1027.2 28.8
## p_loo        26.5  2.4
## looic      2054.3 57.6
## ------
## Monte Carlo SE of elpd_loo is 0.1.
## 
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
loo_compare(b2, b3)
##    elpd_diff se_diff
## b3  0.0       0.0   
## b2 -4.8       3.6
b3 <- b3 %>% add_criterion("loo")
b2 <- b2 %>% add_criterion("loo")
loo_compare(b2, b3)
##    elpd_diff se_diff
## b3  0.0       0.0   
## b2 -4.8       3.6
tidy_draws(b3) %>%
  gf_density(~ b_basement)

mcmc_areas(b3, pars = c("b_basement", "b_log_uranium"))

predict(b3) %>% str()
##  num [1:919, 1:4] 0.359 0.991 0.976 0.995 0.905 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "Estimate" "Est.Error" "Q2.5" "Q97.5"
PRadon <- Radon %>% select(-log_radon) %>%
  group_by(county, log_uranium, basement) %>% distinct() 
PRadon %>% head(4)
## # A tibble: 4 x 3
## # Groups:   county, log_uranium, basement [4]
##   county log_uranium basement
##   <fct>        <dbl>    <dbl>
## 1 AITKIN      -0.689        0
## 2 AITKIN      -0.689        1
## 3 ANOKA       -0.847        1
## 4 ANOKA       -0.847        0
predict(b3, newdata = PRadon) %>%
  as.data.frame() %>%
  bind_cols(PRadon) %>%
  gf_pointrange(Estimate + Q2.5 + Q97.5 ~ log_uranium) %>%
  gf_facet_grid(basement ~ .)

b3$model
## // generated with brms 2.14.4
## functions {
## }
## data {
##   int<lower=1> N;  // total number of observations
##   vector[N] Y;  // response variable
##   int<lower=1> K;  // number of population-level effects
##   matrix[N, K] X;  // population-level design matrix
##   // data for group-level effects of ID 1
##   int<lower=1> N_1;  // number of grouping levels
##   int<lower=1> M_1;  // number of coefficients per level
##   int<lower=1> J_1[N];  // grouping indicator per observation
##   // group-level predictor values
##   vector[N] Z_1_1;
##   int prior_only;  // should the likelihood be ignored?
## }
## transformed data {
##   int Kc = K - 1;
##   matrix[N, Kc] Xc;  // centered version of X without an intercept
##   vector[Kc] means_X;  // column means of X before centering
##   for (i in 2:K) {
##     means_X[i - 1] = mean(X[, i]);
##     Xc[, i - 1] = X[, i] - means_X[i - 1];
##   }
## }
## parameters {
##   vector[Kc] b;  // population-level effects
##   real Intercept;  // temporary intercept for centered predictors
##   real<lower=0> sigma;  // residual SD
##   vector<lower=0>[M_1] sd_1;  // group-level standard deviations
##   vector[N_1] z_1[M_1];  // standardized group-level effects
## }
## transformed parameters {
##   vector[N_1] r_1_1;  // actual group-level effects
##   r_1_1 = (sd_1[1] * (z_1[1]));
## }
## model {
##   // likelihood including all constants
##   if (!prior_only) {
##     // initialize linear predictor term
##     vector[N] mu = Intercept + rep_vector(0.0, N);
##     for (n in 1:N) {
##       // add more terms to the linear predictor
##       mu[n] += r_1_1[J_1[n]] * Z_1_1[n];
##     }
##     target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
##   }
##   // priors including all constants
##   target += normal_lpdf(b[1] | 0, 3);
##   target += normal_lpdf(b[2] | 0, 1);
##   target += student_t_lpdf(Intercept | 3, 1.3, 2.5);
##   target += student_t_lpdf(sigma | 3, 0, 2.5)
##     - 1 * student_t_lccdf(0 | 3, 0, 2.5);
##   target += student_t_lpdf(sd_1 | 3, 0, 2.5)
##     - 1 * student_t_lccdf(0 | 3, 0, 2.5);
##   target += std_normal_lpdf(z_1[1]);
## }
## generated quantities {
##   // actual population-level intercept
##   real b_Intercept = Intercept - dot_product(means_X, b);
## }

Learning more about brms

If you are curious to learn more, check out

Even if you don’t use brms, you might find the model formula notation handy for thinking about the response and predictors in your model.

Logistic Regression with brms

We haven’t yet talked about how to use a different family (t, binomial, etc.) Perhaps we can take a look at a logistic regression in brms a different day.