21 Dichotymous Response

21.1 What Model?

Let’s suppose we want to predict a person’s gender from their height and weight. What sort of model should we use? Give a particular height and weight, a person might be male or female. So for a given height/weight combination, the distribution of gender is a Bernoulli random variable. Our goal is to convert the height/weight combination into the parameter \(\theta\) specifying what proportion of people with that height and weign combination are male (or female).

So our model looks something like

\[\begin{align*} Y &\sim {\sf Bern}(\theta) \\ \theta &= \mathrm{f}(\texttt{height}, \texttt{weight}) \end{align*}\]

But what functions should we use for \(f\)?

21.1.1 A method with issues: Linear regression

One obvious thing to try is a linear function:

\[\begin{align*} Y &\sim {\sf Bern}(\theta) \\ \theta &= \beta_0 + \beta_{\texttt{height}} \texttt{height} + \beta_{\texttt{weight}} \texttt{weight} \end{align*}\]

An interaction term could easily be added as well. This model is basically converting the dichotymous response into a quantitative variable (0’s and 1’s) and using the same sort of regression model we have used for metric variables, but with a different noise distribution.

One problem with this model is that our linear function might well return values outside the interval \([0, 1]\). This would cause two problems:

  • It is making a prediction we know is incorrect.
  • The Bernoulli random variable needs a value of \(\theta\) between 0 and 1, so our likelihood function would be broken.

This model is still sometimes use (but with normal noise rather than Bernoulli noise to avoid breaking the likelihood function). But there are better approaches.

21.1.2 The usual approach: Logistic regression

We would like to use a linear function, but convert its range from \((-\infty, \infty)\) to \((0, 1)\). Alternatively, we can convert the the range \((0,1)\) to \((-\infty, \infty)\) and parameterize the Bernoulli distribution differently.

The most common transformation used the log odds transformation: \[ \begin{array}{rcl} \mathrm{probability} & \theta & (0, 1) \\ \mathrm{odds} & \theta \mapsto \frac{\theta}{1- \theta} & (0, \infty) \\ \mathrm{log\ odds} & \theta \mapsto \log(\frac{\theta}{1- \theta}) & (-\infty, \infty) \\ \end{array} \]

Read backwards, this is the logistic transformation

\[\begin{array}{rcl} \mathrm{log\ odds} & x & (-\infty, \infty) \\ \mathrm{odds} & x \mapsto e^x & (0, \infty) \\ \mathrm{probability} & x \mapsto \frac{e^x}{1 + e^x} & (0, 1) \\ \end{array}\]

These functions are available via the mosaic packge as logit() and ilogit():

logit
## function(x)
## {
##   log(x/(1 - x))
## }
## <bytecode: 0x7f94920e55f0>
## <environment: namespace:mosaicCore>
ilogit
## function (x)
## {
##   exp(x)/(1 + exp(x))
## }
## <bytecode: 0x7f9491bdd618>
## <environment: namespace:mosaicCore>

The inverse logit function is also called the logistic function and the logistic regression model is

\[\begin{align*} Y &\sim {\sf Bern}(\theta) \\ \theta &= \mathrm{logistic}(\beta_0 + \beta_{\texttt{height}} \texttt{height} + \beta_{\texttt{weight}} \texttt{weight}) \\ \mathrm{logit}(\theta) &= \beta_0 + \beta_{\texttt{height}} \texttt{height} + \beta_{\texttt{weight}} \texttt{weight} \end{align*}\]

The logit function is called the link function and the logistic function is the inverse link function.

21.1.3 Other approaches

We could do a similar thing with any pair of functions that convert back and forth between \((0,1)\) and \((-\infty, \infty)\). For any random variable, the cdf (cumulative distribution function) has domain \((-\infty, \infty)\) and range \((0,1)\), so

  • Any cdf/inverse cdf pair can be used in place of the logistic tranformation.
    • using pnorm() and qnorm() is called **probit regression*.
  • We can also work this backwards and create a random variable that has the logistic function as its cdf. The random variable is called (wait for it) the logistic random variable.
gf_function(ilogit, xlim = c(-6, 6), color = ~"logistic") %>%
  gf_function(pnorm, color = ~"probit (standard)") %>%
  gf_function(pnorm, args = list(sd = 1.8), 
              color = ~"probit (mean = 0, sd = 1.8)") %>%
  gf_theme(legend.position = "top") %>%
  gf_labs(color = "")

21.1.4 Preparing the data

We will use a subset of the NHANES data for this example. This is a different data set from the example in the book (which uses synthetic data). Since the data there are in pounds and inches, we’ll convert the NHANES data to these units. As in other models, we need to convert our dichotymous variable into 0’s and 1’s. We certainly want to remove children from the model since height and weight patterns are different for children and adults. In fact, we’ll just select out the 22-year-olds. While we are at it, we’ll get rid of the variables we don’t need and remove any rows with missing values in those three rows.

library(NHANES)
library(brms)
nhanes <- 
  NHANES %>% 
  mutate(
    weight = Weight * 2.2,
    height = Height / 2.54,
    male = as.numeric(NHANES$Gender == "male"),
  ) %>%
  filter(Age == 22) %>%
  select(male, height, weight) %>% 
  filter(complete.cases(.))  # remove rows where any of the 3 variables is missing

21.2 Interpretting Logistic Regression Models

Beore getting our model with two predictors, let’s look at a model with only one predictor.

male_by_weight_brm <-
  brm(male ~ weight, family = bernoulli(), data = nhanes)
## Compiling the C++ model
## recompiling to avoid crashing R session
## Start sampling
male_by_weight_brm
##  Family: bernoulli 
##   Links: mu = logit 
## Formula: male ~ weight 
##    Data: nhanes (Number of observations: 135) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept    -2.12      0.82    -3.78    -0.58       3545 1.00
## weight        0.01      0.00     0.00     0.02       3499 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).
mcmc_combo(as.mcmc.list(stanfit(male_by_weight_brm)))

plot_post(posterior(male_by_weight_brm)$b_weight)

## $posterior
##       ESS    mean  median    mode
## var1 3459 0.01261 0.01246 0.01218
## 
## $hdi
##   prob      lo      hi
## 1 0.95 0.00397 0.02292
p <- marginal_effects(male_by_weight_brm) %>% plot(plot = FALSE)
p$weight %>%
  gf_jitter(male ~ weight, data = nhanes, inherit = FALSE,
            width = 0, height = 0.03, alpha = 0.3)

Things we learn from this model:

  • There is clearly a upward trend – heavier people are more likely to be male than lighter people are. This is seen in the posterior distribution for \(\beta_{\mathrm{weight}}\).

    hdi(posterior(male_by_weight_brm), pars = "b_weight")
    par lo hi mode prob
    b_weight 0.004 0.0229 0.0123 0.95
  • The rise is gentle, not steep. There is no weight at which we abruptly believe gender is much more likely to be male above that weight and female below that weight.

  • For a given value of \(\theta = \langle \beta_0, \beta_{\mathrm{weight}}\rangle\) we can compute the weight at which the model believes the gender split is 50/50. Probability is 0.5 when the odds is 1 and the log odds is 0. And \(\beta_0 + \beta_{\mathrm{weight}} \mathrm{weight} = 0\) when \(\mathrm{weight} = -\beta_0 / \beta_{\mathrm{weight}}\). A 95% HDI For this value is quite wide.

    Post <- 
      posterior(male_by_weight_brm) %>%
      mutate(fifty_fifty = - b_Intercept / b_weight)
    h <- hdi(Post, pars = "fifty_fifty"); h
    par lo hi mode prob
    fifty_fifty 126.6 204.3 172.6 0.95
    p$weight  %>% 
      gf_segment(0.5 + 0.5 ~ lo + hi, data = h, color = "red", inherit = FALSE)  

logistic_brm
##  Family: bernoulli 
##   Links: mu = logit 
## Formula: male ~ height + weight 
##    Data: nhanes (Number of observations: 135) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   -64.73     10.86   -87.60   -45.91       3159 1.00
## height        0.95      0.16     0.67     1.28       3223 1.00
## weight        0.01      0.01    -0.01     0.02       3696 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

21.3 Robust Logistic Regression

For models with metric response, using t distribuitons instead of normal distributions made the models more robust (less influenced by unusually large or small values in the data).

Logistic regression can also be strongly influenced by unusual observations. In our example, if there is an unusually heavy woman or an unusually light man, the likelihood will be very small unless we flatten the logistic curve. But we cannot simply replace a normal distribution with a t distribution because our likelihood does not involve a normal distribution.

Kruschke recommends a methods that fits a mixture of the usual logistic regression model and a model that is “just guessing” (50-50 chance of being male or female). That Stan development folks (and others) recommend a different appraoch: Fit the model with an inverse t cdf. This is like a probit model, but with more flexibility. In particular, if the degrees of freedom is small, the inverse t cdf will approach its assymptotes more slowly than the probit or logistic models do.

Getting brm() to fit this model requires a few extra steps:

  • define the inverse t cdf function using stanvars
  • use an identity link (since we will be coding in the link/inverse link manually).
  • turn off linear model evaluation of the main formula with nl = TRUE. (This can be used for other non-linear model situations as well.)
# define inverse robit function (Stan syntax)
stan_inv_robit <- "
  real inv_robit(real y, real nu) {
    return student_t_cdf(y, nu, 0, (nu - 2) / nu);
  }
"
bform <- bf(
  male ~ inv_robit(theta, nu),
  # non-linear -- use algebra as is in *main* formula
  nl = TRUE,        
  # theta = theta_Intercept + theta_weight * weight + theta_height * height
  theta ~ weight + height,  
  nu ~ 1,            # nu = nu_Intercept (ie. just one value of nu)
  family = bernoulli("identity")           # identity link function
)

bprior <- 
  prior(normal(0, 5), nlpar = "theta") +
  prior(gamma(2, 0.1), nlpar = "nu", lb = 2) 

robit_brm <- 
  brm(bform, prior = bprior, data = nhanes,
      stanvars = stanvar(scode = stan_inv_robit, block = "functions")
  )
## Compiling the C++ model
## Start sampling
## Warning: There were 3 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
prior_summary(robit_brm)
robit_brm
loo(robit_brm, reloo = TRUE)
loo(logistic_brm, reloo = TRUE)
new_data <- expand.grid(
  height = seq(60, 75, by = 3),
  weight = seq(125,225, by = 25)
)
bind_cols(
  new_data,
  data.frame(predict(logistic_brm, newdata = new_data, transform = ilogit))
)
height weight Estimate Est.Error Q2.5 Q97.5
60 125 0.5006 0.0121 0.5000 0.5000
63 125 0.5051 0.0341 0.5000 0.5000
66 125 0.5631 0.1030 0.5000 0.7311
69 125 0.6947 0.0841 0.5000 0.7311
72 125 0.7275 0.0283 0.7311 0.7311
75 125 0.7307 0.0097 0.7311 0.7311
60 150 0.5006 0.0115 0.5000 0.5000
63 150 0.5069 0.0394 0.5000 0.7311
66 150 0.5741 0.1079 0.5000 0.7311
69 150 0.7023 0.0763 0.5000 0.7311
72 150 0.7284 0.0246 0.7311 0.7311
75 150 0.7308 0.0073 0.7311 0.7311
60 175 0.5003 0.0082 0.5000 0.5000
63 175 0.5098 0.0466 0.5000 0.7311
66 175 0.5835 0.1110 0.5000 0.7311
69 175 0.7069 0.0707 0.5000 0.7311
72 175 0.7289 0.0221 0.7311 0.7311
75 175 0.7309 0.0052 0.7311 0.7311
60 200 0.5007 0.0126 0.5000 0.5000
63 200 0.5113 0.0499 0.5000 0.7311
66 200 0.5964 0.1139 0.5000 0.7311
69 200 0.7119 0.0638 0.5000 0.7311
72 200 0.7296 0.0186 0.7311 0.7311
75 200 0.7309 0.0052 0.7311 0.7311
60 225 0.5014 0.0178 0.5000 0.5000
63 225 0.5136 0.0543 0.5000 0.7311
66 225 0.6095 0.1154 0.5000 0.7311
69 225 0.7133 0.0615 0.5000 0.7311
72 225 0.7297 0.0175 0.7311 0.7311
75 225 0.7310 0.0037 0.7311 0.7311
marginal_effects(
  logistic_brm, effects = "height",
  conditions = data.frame(weight = seq(125, 225, by = 25)))