21 Dichotymous Response

21.1 What Model?

Let’s suppose we want to predict a person’s gender from their height and weight. What sort of model should we use? Give a particular height and weight, a person might be male or female. So for a given height/weight combination, the distribution of gender is a Bernoulli random variable. Our goal is to convert the height/weight combination into the parameter $\theta$ specifying what proportion of people with that height and weign combination are male (or female).

So our model looks something like

\[\begin{align*} Y &\sim {\sf Bern}(\theta) \\ \theta &= \mathrm{f}(\texttt{height}, \texttt{weight}) \end{align*}\]

But what functions should we use for $f$?

21.1.1 A method with issues: Linear regression

One obvious thing to try is a linear function:

\[\begin{align*} Y &\sim {\sf Bern}(\theta) \\ \theta &= \beta_0 + \beta_{\texttt{height}} \texttt{height} + \beta_{\texttt{weight}} \texttt{weight} \end{align*}\]

An interaction term could easily be added as well. This model is basically converting the dichotymous response into a quantitative variable (0’s and 1’s) and using the same sort of regression model we have used for metric variables, but with a different noise distribution.

One problem with this model is that our linear function might well return values outside the interval $[0, 1]$. This would cause two problems:

It is making a prediction we know is incorrect.
The Bernoulli random variable needs a value of $\theta$ between 0 and 1, so our likelihood function would be broken.

This model is still sometimes use (but with normal noise rather than Bernoulli noise to avoid breaking the likelihood function). But there are better approaches.

21.1.2 The usual approach: Logistic regression

We would like to use a linear function, but convert its range from $(-\infty, \infty)$ to $(0, 1)$. Alternatively, we can convert the the range $(0,1)$ to $(-\infty, \infty)$ and parameterize the Bernoulli distribution differently.

The most common transformation used the log odds transformation: \[ \begin{array}{rcl} \mathrm{probability} & \theta & (0, 1) \\ \mathrm{odds} & \theta \mapsto \frac{\theta}{1- \theta} & (0, \infty) \\ \mathrm{log\ odds} & \theta \mapsto \log(\frac{\theta}{1- \theta}) & (-\infty, \infty) \\ \end{array} \]

Read backwards, this is the logistic transformation

\[\begin{array}{rcl} \mathrm{log\ odds} & x & (-\infty, \infty) \\ \mathrm{odds} & x \mapsto e^x & (0, \infty) \\ \mathrm{probability} & x \mapsto \frac{e^x}{1 + e^x} & (0, 1) \\ \end{array}\]

These functions are available via the mosaic packge as logit() and ilogit():

logit

## function(x)
## {
##   log(x/(1 - x))
## }
## <bytecode: 0x7f94920e55f0>
## <environment: namespace:mosaicCore>

ilogit

## function (x)
## {
##   exp(x)/(1 + exp(x))
## }
## <bytecode: 0x7f9491bdd618>
## <environment: namespace:mosaicCore>

The inverse logit function is also called the logistic function and the logistic regression model is

\[\begin{align*} Y &\sim {\sf Bern}(\theta) \\ \theta &= \mathrm{logistic}(\beta_0 + \beta_{\texttt{height}} \texttt{height} + \beta_{\texttt{weight}} \texttt{weight}) \\ \mathrm{logit}(\theta) &= \beta_0 + \beta_{\texttt{height}} \texttt{height} + \beta_{\texttt{weight}} \texttt{weight} \end{align*}\]

The logit function is called the link function and the logistic function is the inverse link function.

21.1.3 Other approaches

We could do a similar thing with any pair of functions that convert back and forth between $(0,1)$ and $(-\infty, \infty)$. For any random variable, the cdf (cumulative distribution function) has domain $(-\infty, \infty)$ and range $(0,1)$, so

Any cdf/inverse cdf pair can be used in place of the logistic tranformation.
- using pnorm() and qnorm() is called **probit regression*.
We can also work this backwards and create a random variable that has the logistic function as its cdf. The random variable is called (wait for it) the logistic random variable.

gf_function(ilogit, xlim = c(-6, 6), color = ~"logistic") %>%
  gf_function(pnorm, color = ~"probit (standard)") %>%
  gf_function(pnorm, args = list(sd = 1.8), 
              color = ~"probit (mean = 0, sd = 1.8)") %>%
  gf_theme(legend.position = "top") %>%
  gf_labs(color = "")

21.1.4 Preparing the data

We will use a subset of the NHANES data for this example. This is a different data set from the example in the book (which uses synthetic data). Since the data there are in pounds and inches, we’ll convert the NHANES data to these units. As in other models, we need to convert our dichotymous variable into 0’s and 1’s. We certainly want to remove children from the model since height and weight patterns are different for children and adults. In fact, we’ll just select out the 22-year-olds. While we are at it, we’ll get rid of the variables we don’t need and remove any rows with missing values in those three rows.

library(NHANES)
library(brms)
nhanes <- 
  NHANES %>% 
  mutate(
    weight = Weight * 2.2,
    height = Height / 2.54,
    male = as.numeric(NHANES$Gender == "male"),
  ) %>%
  filter(Age == 22) %>%
  select(male, height, weight) %>% 
  filter(complete.cases(.))  # remove rows where any of the 3 variables is missing

21.1.5 Specifying family and link function in brm()

Compared to our usual linear regression model, we need to make two adjustments:

Use the Bernoulli family of distriutions for the noise
Use the logit link (logistic inverse link) function to translate back and forth between the linear part of the model and the distribution.

Both are done simultaneously when we set the family argument in brm(). Each family comes with a default link function, but we can override that if we like.

So for logistic and probit regression, we use

logistic_brm <-
  brm(male ~ height + weight, family = bernoulli(link = logit), data = nhanes)

## Compiling the C++ model

## Start sampling

probit_brm <-
  brm(male ~ height + weight, family = bernoulli(link = probit), data = nhanes)

## Compiling the C++ model

## Start sampling

The rest of the model behaves as before. In particular, a t-distribution is used for the intercept and improper uniform distributions are used for the other regression coefficients. This model doesn’t have a $\sigma$ parameter. (The variance of a Bernoulli distribution is determined by the probability parameter.)

prior_summary(logistic_brm)

prior	class	coef
	b
	b	height
	b	weight
student_t(3, 0, 10)	Intercept

21.2 Interpretting Logistic Regression Models

Beore getting our model with two predictors, let’s look at a model with only one predictor.

male_by_weight_brm <-
  brm(male ~ weight, family = bernoulli(), data = nhanes)

## Compiling the C++ model

## recompiling to avoid crashing R session

## Start sampling

male_by_weight_brm

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: male ~ weight 
##    Data: nhanes (Number of observations: 135) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept    -2.12      0.82    -3.78    -0.58       3545 1.00
## weight        0.01      0.00     0.00     0.02       3499 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

mcmc_combo(as.mcmc.list(stanfit(male_by_weight_brm)))

plot_post(posterior(male_by_weight_brm)$b_weight)

## $posterior
##       ESS    mean  median    mode
## var1 3459 0.01261 0.01246 0.01218
## 
## $hdi
##   prob      lo      hi
## 1 0.95 0.00397 0.02292

p <- marginal_effects(male_by_weight_brm) %>% plot(plot = FALSE)
p$weight %>%
  gf_jitter(male ~ weight, data = nhanes, inherit = FALSE,
            width = 0, height = 0.03, alpha = 0.3)

Things we learn from this model:

There is clearly a upward trend – heavier people are more likely to be male than lighter people are. This is seen in the posterior distribution for $\beta_{\mathrm{weight}}$.
```
hdi(posterior(male_by_weight_brm), pars = "b_weight")
```
par lo hi mode prob

b_weight 0.004 0.0229 0.0123 0.95
The rise is gentle, not steep. There is no weight at which we abruptly believe gender is much more likely to be male above that weight and female below that weight.
For a given value of $\theta = \langle \beta_0, \beta_{\mathrm{weight}}\rangle$ we can compute the weight at which the model believes the gender split is 50/50. Probability is 0.5 when the odds is 1 and the log odds is 0. And $\beta_0 + \beta_{\mathrm{weight}} \mathrm{weight} = 0$ when $\mathrm{weight} = -\beta_0 / \beta_{\mathrm{weight}}$. A 95% HDI For this value is quite wide.
```
Post <- 
  posterior(male_by_weight_brm) %>%
  mutate(fifty_fifty = - b_Intercept / b_weight)
h <- hdi(Post, pars = "fifty_fifty"); h
```
par lo hi mode prob

fifty_fifty 126.6 204.3 172.6 0.95
```
p$weight  %>% 
  gf_segment(0.5 + 0.5 ~ lo + hi, data = h, color = "red", inherit = FALSE)  
```

par	lo	hi	mode	prob
b_weight	0.004	0.0229	0.0123	0.95

par	lo	hi	mode	prob
fifty_fifty	126.6	204.3	172.6	0.95

logistic_brm

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: male ~ height + weight 
##    Data: nhanes (Number of observations: 135) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept   -64.73     10.86   -87.60   -45.91       3159 1.00
## height        0.95      0.16     0.67     1.28       3223 1.00
## weight        0.01      0.01    -0.01     0.02       3696 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

21.3 Robust Logistic Regression

For models with metric response, using t distribuitons instead of normal distributions made the models more robust (less influenced by unusually large or small values in the data).

Logistic regression can also be strongly influenced by unusual observations. In our example, if there is an unusually heavy woman or an unusually light man, the likelihood will be very small unless we flatten the logistic curve. But we cannot simply replace a normal distribution with a t distribution because our likelihood does not involve a normal distribution.

Kruschke recommends a methods that fits a mixture of the usual logistic regression model and a model that is “just guessing” (50-50 chance of being male or female). That Stan development folks (and others) recommend a different appraoch: Fit the model with an inverse t cdf. This is like a probit model, but with more flexibility. In particular, if the degrees of freedom is small, the inverse t cdf will approach its assymptotes more slowly than the probit or logistic models do.

Getting brm() to fit this model requires a few extra steps:

define the inverse t cdf function using stanvars
use an identity link (since we will be coding in the link/inverse link manually).
turn off linear model evaluation of the main formula with nl = TRUE. (This can be used for other non-linear model situations as well.)

# define inverse robit function (Stan syntax)
stan_inv_robit <- "
  real inv_robit(real y, real nu) {
    return student_t_cdf(y, nu, 0, (nu - 2) / nu);
  }
"
bform <- bf(
  male ~ inv_robit(theta, nu),
  # non-linear -- use algebra as is in *main* formula
  nl = TRUE,        
  # theta = theta_Intercept + theta_weight * weight + theta_height * height
  theta ~ weight + height,  
  nu ~ 1,            # nu = nu_Intercept (ie. just one value of nu)
  family = bernoulli("identity")           # identity link function
)

bprior <- 
  prior(normal(0, 5), nlpar = "theta") +
  prior(gamma(2, 0.1), nlpar = "nu", lb = 2) 

robit_brm <- 
  brm(bform, prior = bprior, data = nhanes,
      stanvars = stanvar(scode = stan_inv_robit, block = "functions")
  )

## Compiling the C++ model

## Start sampling

## Warning: There were 3 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

prior_summary(robit_brm)
robit_brm
loo(robit_brm, reloo = TRUE)
loo(logistic_brm, reloo = TRUE)

new_data <- expand.grid(
  height = seq(60, 75, by = 3),
  weight = seq(125,225, by = 25)
)
bind_cols(
  new_data,
  data.frame(predict(logistic_brm, newdata = new_data, transform = ilogit))
)

height	weight	Estimate	Est.Error	Q2.5	Q97.5
60	125	0.5006	0.0121	0.5000	0.5000
63	125	0.5051	0.0341	0.5000	0.5000
66	125	0.5631	0.1030	0.5000	0.7311
69	125	0.6947	0.0841	0.5000	0.7311
72	125	0.7275	0.0283	0.7311	0.7311
75	125	0.7307	0.0097	0.7311	0.7311
60	150	0.5006	0.0115	0.5000	0.5000
63	150	0.5069	0.0394	0.5000	0.7311
66	150	0.5741	0.1079	0.5000	0.7311
69	150	0.7023	0.0763	0.5000	0.7311
72	150	0.7284	0.0246	0.7311	0.7311
75	150	0.7308	0.0073	0.7311	0.7311
60	175	0.5003	0.0082	0.5000	0.5000
63	175	0.5098	0.0466	0.5000	0.7311
66	175	0.5835	0.1110	0.5000	0.7311
69	175	0.7069	0.0707	0.5000	0.7311
72	175	0.7289	0.0221	0.7311	0.7311
75	175	0.7309	0.0052	0.7311	0.7311
60	200	0.5007	0.0126	0.5000	0.5000
63	200	0.5113	0.0499	0.5000	0.7311
66	200	0.5964	0.1139	0.5000	0.7311
69	200	0.7119	0.0638	0.5000	0.7311
72	200	0.7296	0.0186	0.7311	0.7311
75	200	0.7309	0.0052	0.7311	0.7311
60	225	0.5014	0.0178	0.5000	0.5000
63	225	0.5136	0.0543	0.5000	0.7311
66	225	0.6095	0.1154	0.5000	0.7311
69	225	0.7133	0.0615	0.5000	0.7311
72	225	0.7297	0.0175	0.7311	0.7311
75	225	0.7310	0.0037	0.7311	0.7311

marginal_effects(
  logistic_brm, effects = "height",
  conditions = data.frame(weight = seq(125, 225, by = 25)))