Categorical Predictors (with more than two levels)
Another multiple regression example (illustrating another thing multiple regression can do for us)
We are collecting up some examples of the types of things that can happen in multiple regression models so that we are better prepeared to build and interpret them.
Comparative primate milk composition data, from Table 2 of Hinde and Milligan. 2011. Evolutionary Anthropology 20:9-23.
library(rethinking)
data(milk)
glimpse(milk)
## Rows: 29
## Columns: 8
## $ clade <fct> Strepsirrhine, Strepsirrhine, Strepsirrhine, Strepsirrh…
## $ species <fct> Eulemur fulvus, E macaco, E mongoz, E rubriventer, Lemu…
## $ kcal.per.g <dbl> 0.49, 0.51, 0.46, 0.48, 0.60, 0.47, 0.56, 0.89, 0.91, 0…
## $ perc.fat <dbl> 16.60, 19.27, 14.11, 14.91, 27.28, 21.22, 29.66, 53.41,…
## $ perc.protein <dbl> 15.42, 16.91, 16.85, 13.18, 19.50, 23.58, 23.46, 15.80,…
## $ perc.lactose <dbl> 67.98, 63.82, 69.04, 71.91, 53.22, 55.20, 46.88, 30.79,…
## $ mass <dbl> 1.95, 2.09, 2.51, 1.62, 2.19, 5.25, 5.37, 2.51, 0.71, 0…
## $ neocortex.perc <dbl> 55.16, NA, NA, NA, NA, 64.54, 64.54, 67.64, NA, 68.85, …
clade
: Broad taxonomic group
species
: Species name
kcal.per.g
: Kilocalories per gram of milk
perc.fat
: Percent fat
perc.protein
: Percent protein
perc.lactose
: Percent lactose
mass
: Body mass of mother, in kilograms
neocortex.perc
: Percent of brain mass that is neocortex
indicator variables
index variables
Let \(x_i\) be one of \(1, 2, 3, \dots, k\) indicating which of \(k\) categories, and consider a model with
\[ \mu_i = \mbox{other stuff} + \beta_1 x_i \] This is almost never the model we want. Do you see why?
1. That variable \(x_i\) can be used as an index variable:
\[ \mu_i = \mbox{other stuff} + \beta_{x_i} \] where “other stuff” might be an intercept term or some parts of the model that depend on other predictors.
So if the first few rows of our data look like
tibble(x = c(1, 2, 1, 3)) %>% pander()
x |
---|
1 |
2 |
1 |
3 |
Then we have
\[\begin{align*} \mu_1 = \mbox{other stuff} + \beta_1 \\ \mu_2 = \mbox{other stuff} + \beta_2 \\ \mu_3 = \mbox{other stuff} + \beta_1 \\ \mu_4 = \mbox{other stuff} + \beta_3 \\ \end{align*}\]
2. Use \(k\) or \(k-1\) indicator variables:
Here’s how it looks with \(k\) indicator variables
\[
\mu_i = \mbox{other stuff}
+ \beta_1 [\![ x_i = 1 ]\!]
+ \beta_2 [\![ x_i = 2 ]\!]
+ \cdots
+ \beta_{k-1} [\![ x_i = k - 1 ]\!]
+ \beta_{k} [\![ x_i = k ]\!]
\] Again, if the first few values of x
are 1, 2, 1, 3
, then this looks like
\[\begin{align*} \mu_1 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 \\ \mu_2 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 1 + \beta_3 \cdot 0 \\ \mu_3 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 \\ \mu_4 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 0 + \beta_3 \cdot 1 \\ \end{align*}\]
But this is essentially the same as before – it’s just clunkier to work with.
\[\begin{align*} \mu_1 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 = \mbox{other stuff} + \beta_1 \\ \mu_2 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 1 + \beta_3 \cdot 0 = \mbox{other stuff} + \beta_2 \\ \mu_3 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 = \mbox{other stuff} + \beta_1 \\ \mu_4 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 0 + \beta_3 \cdot 1 = \mbox{other stuff} + \beta_3 \\ \end{align*}\]
If we drop one of the indicator variables, than “other stuff” captures the mean for one of the groups and the \(\beta\)’s give offsets from that baseline.
Option 2 (with \(k-1\) indicators when there is an intercept in the model) is what most non-bayesian statistical packages do, including lm()
in R. The first option will usually be more convenient for us.
gf_jitter(kcal.per.g ~ clade, data = milk, height = 0, width = 0.1)
Milk <-
milk %>%
mutate(
K = standardize(kcal.per.g),
clade_id = as.numeric(clade), # this works because clade is a factor
)
Milk %>% select(species, matches("clade")) %>% sample_n(5)
## species clade clade_id
## 1 Miopithecus talpoin Old World Monkey 3
## 2 G gorilla beringei Ape 1
## 3 Pongo pygmaeus Ape 1
## 4 P troglodytes Ape 1
## 5 Cebuella pygmaea New World Monkey 2
m5.9 <- quap(
alist(
K ~ dnorm(mu, sigma),
mu <- a[clade_id],
a[clade_id] ~ dnorm(0, 0.5),
sigma ~ dexp(1)
),
data = Milk
)
precis(m5.9, depth = 2)
## mean sd 5.5% 94.5%
## a[1] -0.4842974 0.21764249 -0.83213212 -0.1364626
## a[2] 0.3662576 0.21705886 0.01935567 0.7131596
## a[3] 0.6752168 0.25753427 0.26362735 1.0868063
## a[4] -0.5858850 0.27450663 -1.02459957 -0.1471703
## sigma 0.7196455 0.09653355 0.56536622 0.8739247
plot(precis(m5.9, depth = 2))
Using an index variable results in a hiearchical posterior.
Post5.9 <-
extract.samples(m5.9)
str(Post5.9)
## List of 2
## $ sigma: num [1:10000] 0.513 0.742 0.755 0.595 0.66 ...
## $ a : num [1:10000, 1:4] -0.557 -0.285 -0.209 -0.224 -0.358 ...
## - attr(*, "source")= chr "quap posterior: 10000 samples from m5.9"
Fortunately, it is easy to convert this into a data frame if we prefer that structure (for plotting, for example).
str(Post5.9 %>% data.frame())
## 'data.frame': 10000 obs. of 5 variables:
## $ sigma: num 0.513 0.742 0.755 0.595 0.66 ...
## $ a.1 : num -0.557 -0.285 -0.209 -0.224 -0.358 ...
## $ a.2 : num 0.0974 0.5181 0.4566 0.246 0.0369 ...
## $ a.3 : num 0.689 0.381 0.648 0.813 0.59 ...
## $ a.4 : num -0.8358 -0.7027 -0.0855 -0.4524 -0.4745 ...
precis(Post5.9, depth = 2)
## mean sd 5.5% 94.5% histogram
## sigma 0.7204276 0.09769931 0.56374050 0.8752755 ▁▁▂▇▇▃▁▁▁
## a[1] -0.4822450 0.21704692 -0.83152325 -0.1381971 ▁▁▁▅▇▅▂▁▁▁
## a[2] 0.3657909 0.21532564 0.02373309 0.7107818 ▁▁▁▃▇▇▂▁▁
## a[3] 0.6760220 0.25542134 0.26656987 1.0802727 ▁▁▁▂▅▇▅▂▁▁▁
## a[4] -0.5861043 0.27328384 -1.02943353 -0.1466456 ▁▁▂▃▇▇▃▂▁▁
precis(Post5.9 %>% data.frame())
## mean sd 5.5% 94.5% histogram
## sigma 0.7204276 0.09769931 0.56374050 0.8752755 ▁▁▂▇▇▃▁▁▁
## a.1 -0.4822450 0.21704692 -0.83152325 -0.1381971 ▁▁▁▅▇▅▂▁▁▁
## a.2 0.3657909 0.21532564 0.02373309 0.7107818 ▁▁▁▃▇▇▂▁▁
## a.3 0.6760220 0.25542134 0.26656987 1.0802727 ▁▁▁▂▅▇▅▂▁▁▁
## a.4 -0.5861043 0.27328384 -1.02943353 -0.1466456 ▁▁▂▃▇▇▃▂▁▁
GGally::ggpairs(Post5.9 %>% data.frame())
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
How can we use this model to compare two of the groups?
Post5.9 %>% data.frame() %>%
mutate(d12 = a.2 - a.1) %>%
gf_dens(~ d12)
Milk <-
Milk %>%
mutate(
K = standardize(kcal.per.g), # we already did this, so this is just a reminder
N = rethinking::standardize(neocortex.perc),
M = standardize(log(mass))
)
# This fails.
m5.5_draft <-
quap(
alist(
K ~ dnorm(mu, sigma),
mu <- a + b_N * N,
a ~ dnorm(0, 1),
b_N ~ dnorm(0, 1),
sigma ~ dexp(1)
),
data = Milk
)
## Error in quap(alist(K ~ dnorm(mu, sigma), mu <- a + b_N * N, a ~ dnorm(0, : initial value in 'vmmin' is not finite
## The start values for the parameters were invalid. This could be caused by missing values (NA) in the data or by start values outside the parameter constraints. If there are no NA values in the data, try using explicit start values.
# Here's why the previous chunk failed -- missing data!
df_stats( ~ neocortex.perc + N, data = Milk)
## response min Q1 median Q3 max
## 1 neocortex.perc 55.160000 64.5400000 68.8500000 71.2600000 76.300000
## 2 N -2.080196 -0.5086413 0.2134697 0.6172487 1.461666
## mean sd n missing
## 1 6.757588e+01 5.968612 17 12
## 2 -4.124815e-16 1.000000 17 12
# Let's grab just the rows that have no missing data.
MilkCC <- Milk %>% drop_na(K, N, M)
df_stats( ~ neocortex.perc + N, data = MilkCC)
## response min Q1 median Q3 max
## 1 neocortex.perc 55.160000 64.5400000 68.8500000 71.2600000 76.300000
## 2 N -2.080196 -0.5086413 0.2134697 0.6172487 1.461666
## mean sd n missing
## 1 6.757588e+01 5.968612 17 0
## 2 -4.124815e-16 1.000000 17 0
m5.5_draft <-
quap(
alist(
K ~ dnorm(mu, sigma),
mu <- a + b_N * N,
a ~ dnorm(0, 1),
b_N ~ dnorm(0, 1),
sigma ~ dexp(1)
),
data = MilkCC
)
extract.prior(m5.5_draft, n = 100) %>%
glimpse()
## List of 3
## $ a : num [1:100(1d)] 0.525 -0.312 -1.093 0.757 0.287 ...
## $ b_N : num [1:100(1d)] -0.203 -0.502 1.123 0.415 0.517 ...
## $ sigma: num [1:100(1d)] 1.539 1.623 1.412 0.78 0.426 ...
## - attr(*, "source")= chr "quap prior: 100 samples from m5.5_draft"
Prior5.5_draft <-
extract.prior(m5.5_draft, n = 100) %>%
as.data.frame()
glimpse(Prior5.5_draft)
## Rows: 100
## Columns: 3
## $ a <dbl> 1.16838381, -1.73163977, 0.11542475, 0.71543959, -0.42762136, -2…
## $ b_N <dbl> -0.01452631, 1.38186109, 1.42463243, -0.48670868, 1.92322155, 0.…
## $ sigma <dbl> 1.13696236, 1.64561095, 0.80512597, 1.39450005, 1.01539775, 0.06…
gf_abline(slope = ~ b_N, intercept = ~ a, data = Prior5.5_draft, alpha = 0.6) %>%
gf_lims(x = c(-5, 5), y = c(-5, 5))
m5.5 <-
quap(
alist(
K ~ dnorm(mu, sigma),
mu <- a + bN * N,
a ~ dnorm(0, 0.2),
bN ~ dnorm(0, 0.5),
sigma ~ dexp(1)
),
data = MilkCC
)
Prior5.5 <-
extract.prior(m5.5, n = 100) %>%
as.data.frame()
glimpse(Prior5.5)
## Rows: 100
## Columns: 3
## $ a <dbl> -0.22454960, 0.07675617, 0.22894470, 0.02268500, 0.13165475, -0.…
## $ bN <dbl> 0.27573545, 0.22638737, -0.04033851, -0.35714949, -0.68082106, 0…
## $ sigma <dbl> 0.61456371, 1.75763217, 0.79299614, 0.20033932, 0.16160024, 0.41…
gf_abline(slope = ~ bN, intercept = ~ a, data = Prior5.5, alpha = 0.6) %>%
gf_lims(x = c(-3, 3), y = c(-3, 3))
precis(m5.5, digits = 3)
## mean sd 5.5% 94.5%
## a 0.03993977 0.1544866 -0.2069597 0.2868393
## bN 0.13323038 0.2237351 -0.2243415 0.4908022
## sigma 0.99975421 0.1646813 0.7365617 1.2629467
extract.samples(m5.5) %>%
gf_dens(~ bN)
m5.6 <-
quap(
alist(
K ~ dnorm(mu, sigma),
mu <- a + bM * M,
a ~ dnorm(0, 0.2),
bM ~ dnorm(0, 0.5),
sigma ~ dexp(1)
),
data = MilkCC
)
m5.7 <-
quap(
alist(
K ~ dnorm(mu, sigma),
mu <- a + bM * M + bN * N,
a ~ dnorm(0, 0.2),
bM ~ dnorm(0, 0.5),
bN ~ dnorm(0, 0.5),
sigma ~ dexp(1)
),
data = MilkCC
)
plot(coeftab(m5.5, m5.6, m5.7))
So far we have seen two examples (spurious association and masking) where adding in additional variables helps us understand what is going on. Unfortunately, adding variables to the model can actually make things worse in some situations. Stay tuned for more examples.