Goals for today

  1. Categorical Predictors (with more than two levels)

    • Bigger picture: Always be sure you understand what the parameters in your model are telling you.
  2. Another multiple regression example (illustrating another thing multiple regression can do for us)

    • Example from last time illustrated a spruious association.
    • Today’s example will illustrate masking

    We are collecting up some examples of the types of things that can happen in multiple regression models so that we are better prepeared to build and interpret them.


Data set of the day: The milk data set

Source

Comparative primate milk composition data, from Table 2 of Hinde and Milligan. 2011. Evolutionary Anthropology 20:9-23.

Taking a look

library(rethinking)
data(milk)
glimpse(milk)
## Rows: 29
## Columns: 8
## $ clade          <fct> Strepsirrhine, Strepsirrhine, Strepsirrhine, Strepsirrh…
## $ species        <fct> Eulemur fulvus, E macaco, E mongoz, E rubriventer, Lemu…
## $ kcal.per.g     <dbl> 0.49, 0.51, 0.46, 0.48, 0.60, 0.47, 0.56, 0.89, 0.91, 0…
## $ perc.fat       <dbl> 16.60, 19.27, 14.11, 14.91, 27.28, 21.22, 29.66, 53.41,…
## $ perc.protein   <dbl> 15.42, 16.91, 16.85, 13.18, 19.50, 23.58, 23.46, 15.80,…
## $ perc.lactose   <dbl> 67.98, 63.82, 69.04, 71.91, 53.22, 55.20, 46.88, 30.79,…
## $ mass           <dbl> 1.95, 2.09, 2.51, 1.62, 2.19, 5.25, 5.37, 2.51, 0.71, 0…
## $ neocortex.perc <dbl> 55.16, NA, NA, NA, NA, 64.54, 64.54, 67.64, NA, 68.85, …

Variables

  • clade: Broad taxonomic group

  • species: Species name

  • kcal.per.g: Kilocalories per gram of milk

  • perc.fat: Percent fat

  • perc.protein: Percent protein

  • perc.lactose: Percent lactose

  • mass: Body mass of mother, in kilograms

  • neocortex.perc: Percent of brain mass that is neocortex

Categorical Predictors – 2 approaches

  1. indicator variables

  2. index variables

Categorical Predictors – More than 2 Levels

A bad idea

Let \(x_i\) be one of \(1, 2, 3, \dots, k\) indicating which of \(k\) categories, and consider a model with

\[ \mu_i = \mbox{other stuff} + \beta_1 x_i \] This is almost never the model we want. Do you see why?

Fixing the bad idea: two ways

1. That variable \(x_i\) can be used as an index variable:

\[ \mu_i = \mbox{other stuff} + \beta_{x_i} \] where “other stuff” might be an intercept term or some parts of the model that depend on other predictors.

So if the first few rows of our data look like

tibble(x = c(1, 2, 1, 3)) %>% pander()
x
1
2
1
3

Then we have

\[\begin{align*} \mu_1 = \mbox{other stuff} + \beta_1 \\ \mu_2 = \mbox{other stuff} + \beta_2 \\ \mu_3 = \mbox{other stuff} + \beta_1 \\ \mu_4 = \mbox{other stuff} + \beta_3 \\ \end{align*}\]

2. Use \(k\) or \(k-1\) indicator variables:

Here’s how it looks with \(k\) indicator variables

\[ \mu_i = \mbox{other stuff} + \beta_1 [\![ x_i = 1 ]\!] + \beta_2 [\![ x_i = 2 ]\!] + \cdots + \beta_{k-1} [\![ x_i = k - 1 ]\!] + \beta_{k} [\![ x_i = k ]\!] \] Again, if the first few values of x are 1, 2, 1, 3, then this looks like

\[\begin{align*} \mu_1 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 \\ \mu_2 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 1 + \beta_3 \cdot 0 \\ \mu_3 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 \\ \mu_4 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 0 + \beta_3 \cdot 1 \\ \end{align*}\]

But this is essentially the same as before – it’s just clunkier to work with.

\[\begin{align*} \mu_1 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 = \mbox{other stuff} + \beta_1 \\ \mu_2 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 1 + \beta_3 \cdot 0 = \mbox{other stuff} + \beta_2 \\ \mu_3 &= \mbox{other stuff} + \beta_1 \cdot 1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 = \mbox{other stuff} + \beta_1 \\ \mu_4 &= \mbox{other stuff} + \beta_1 \cdot 0 + \beta_2 \cdot 0 + \beta_3 \cdot 1 = \mbox{other stuff} + \beta_3 \\ \end{align*}\]

If we drop one of the indicator variables, than “other stuff” captures the mean for one of the groups and the \(\beta\)’s give offsets from that baseline.

Option 2 (with \(k-1\) indicators when there is an intercept in the model) is what most non-bayesian statistical packages do, including lm() in R. The first option will usually be more convenient for us.

Milk: Energy by clade

gf_jitter(kcal.per.g ~ clade, data = milk, height = 0, width = 0.1)

Creating an index variable (and standardizing)

Milk <-
  milk %>% 
  mutate(
    K = standardize(kcal.per.g),
    clade_id = as.numeric(clade),  # this works because clade is a factor
  )
Milk %>% select(species, matches("clade")) %>% sample_n(5)
##               species            clade clade_id
## 1 Miopithecus talpoin Old World Monkey        3
## 2  G gorilla beringei              Ape        1
## 3      Pongo pygmaeus              Ape        1
## 4       P troglodytes              Ape        1
## 5    Cebuella pygmaea New World Monkey        2

Using an index variable

m5.9 <- quap(
  alist(
    K           ~ dnorm(mu, sigma),
    mu         <- a[clade_id],
    a[clade_id] ~ dnorm(0, 0.5),
    sigma       ~ dexp(1)
  ),
  data = Milk
)
precis(m5.9, depth = 2)
##             mean         sd        5.5%      94.5%
## a[1]  -0.4842974 0.21764249 -0.83213212 -0.1364626
## a[2]   0.3662576 0.21705886  0.01935567  0.7131596
## a[3]   0.6752168 0.25753427  0.26362735  1.0868063
## a[4]  -0.5858850 0.27450663 -1.02459957 -0.1471703
## sigma  0.7196455 0.09653355  0.56536622  0.8739247
plot(precis(m5.9, depth = 2))

Hiearchical Posterior

Using an index variable results in a hiearchical posterior.

Post5.9 <- 
  extract.samples(m5.9) 
str(Post5.9)
## List of 2
##  $ sigma: num [1:10000] 0.513 0.742 0.755 0.595 0.66 ...
##  $ a    : num [1:10000, 1:4] -0.557 -0.285 -0.209 -0.224 -0.358 ...
##  - attr(*, "source")= chr "quap posterior: 10000 samples from m5.9"

Fortunately, it is easy to convert this into a data frame if we prefer that structure (for plotting, for example).

str(Post5.9 %>% data.frame())
## 'data.frame':    10000 obs. of  5 variables:
##  $ sigma: num  0.513 0.742 0.755 0.595 0.66 ...
##  $ a.1  : num  -0.557 -0.285 -0.209 -0.224 -0.358 ...
##  $ a.2  : num  0.0974 0.5181 0.4566 0.246 0.0369 ...
##  $ a.3  : num  0.689 0.381 0.648 0.813 0.59 ...
##  $ a.4  : num  -0.8358 -0.7027 -0.0855 -0.4524 -0.4745 ...
precis(Post5.9, depth = 2)
##             mean         sd        5.5%      94.5%   histogram
## sigma  0.7204276 0.09769931  0.56374050  0.8752755   ▁▁▂▇▇▃▁▁▁
## a[1]  -0.4822450 0.21704692 -0.83152325 -0.1381971  ▁▁▁▅▇▅▂▁▁▁
## a[2]   0.3657909 0.21532564  0.02373309  0.7107818   ▁▁▁▃▇▇▂▁▁
## a[3]   0.6760220 0.25542134  0.26656987  1.0802727 ▁▁▁▂▅▇▅▂▁▁▁
## a[4]  -0.5861043 0.27328384 -1.02943353 -0.1466456  ▁▁▂▃▇▇▃▂▁▁
precis(Post5.9 %>% data.frame())
##             mean         sd        5.5%      94.5%   histogram
## sigma  0.7204276 0.09769931  0.56374050  0.8752755   ▁▁▂▇▇▃▁▁▁
## a.1   -0.4822450 0.21704692 -0.83152325 -0.1381971  ▁▁▁▅▇▅▂▁▁▁
## a.2    0.3657909 0.21532564  0.02373309  0.7107818   ▁▁▁▃▇▇▂▁▁
## a.3    0.6760220 0.25542134  0.26656987  1.0802727 ▁▁▁▂▅▇▅▂▁▁▁
## a.4   -0.5861043 0.27328384 -1.02943353 -0.1466456  ▁▁▂▃▇▇▃▂▁▁
GGally::ggpairs(Post5.9 %>% data.frame())
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Comparing 2 groups

How can we use this model to compare two of the groups?

Post5.9 %>% data.frame() %>%
  mutate(d12 = a.2 - a.1) %>%
  gf_dens(~ d12)

Milk: Energy by brain size and body mass

Milk <- 
  Milk %>% 
  mutate(
    K = standardize(kcal.per.g),    # we already did this, so this is just a reminder
    N = rethinking::standardize(neocortex.perc),
    M = standardize(log(mass))
  )

A failed attempt

# This fails.
m5.5_draft <- 
  quap(
    alist(
      K ~ dnorm(mu, sigma),
      mu <- a + b_N * N,
      a ~ dnorm(0, 1),
      b_N ~ dnorm(0, 1),
      sigma ~ dexp(1)
    ),
    data = Milk
  )
## Error in quap(alist(K ~ dnorm(mu, sigma), mu <- a + b_N * N, a ~ dnorm(0, : initial value in 'vmmin' is not finite
## The start values for the parameters were invalid. This could be caused by missing values (NA) in the data or by start values outside the parameter constraints. If there are no NA values in the data, try using explicit start values.

Fixing the problem

# Here's why the previous chunk failed -- missing data!
df_stats( ~ neocortex.perc + N, data = Milk)
##         response       min         Q1     median         Q3       max
## 1 neocortex.perc 55.160000 64.5400000 68.8500000 71.2600000 76.300000
## 2              N -2.080196 -0.5086413  0.2134697  0.6172487  1.461666
##            mean       sd  n missing
## 1  6.757588e+01 5.968612 17      12
## 2 -4.124815e-16 1.000000 17      12
# Let's grab just the rows that have no missing data.
MilkCC <- Milk %>% drop_na(K, N, M)
df_stats( ~ neocortex.perc + N, data = MilkCC)
##         response       min         Q1     median         Q3       max
## 1 neocortex.perc 55.160000 64.5400000 68.8500000 71.2600000 76.300000
## 2              N -2.080196 -0.5086413  0.2134697  0.6172487  1.461666
##            mean       sd  n missing
## 1  6.757588e+01 5.968612 17       0
## 2 -4.124815e-16 1.000000 17       0
m5.5_draft <- 
  quap(
    alist(
      K ~ dnorm(mu, sigma),
      mu <- a + b_N * N,
      a ~ dnorm(0, 1),
      b_N ~ dnorm(0, 1),
      sigma ~ dexp(1)
    ),
    data = MilkCC
  )

Checking the prior

extract.prior(m5.5_draft, n = 100) %>%
  glimpse()
## List of 3
##  $ a    : num [1:100(1d)] 0.525 -0.312 -1.093 0.757 0.287 ...
##  $ b_N  : num [1:100(1d)] -0.203 -0.502 1.123 0.415 0.517 ...
##  $ sigma: num [1:100(1d)] 1.539 1.623 1.412 0.78 0.426 ...
##  - attr(*, "source")= chr "quap prior: 100 samples from m5.5_draft"
Prior5.5_draft <-
  extract.prior(m5.5_draft, n = 100) %>%
  as.data.frame()

glimpse(Prior5.5_draft)
## Rows: 100
## Columns: 3
## $ a     <dbl> 1.16838381, -1.73163977, 0.11542475, 0.71543959, -0.42762136, -2…
## $ b_N   <dbl> -0.01452631, 1.38186109, 1.42463243, -0.48670868, 1.92322155, 0.…
## $ sigma <dbl> 1.13696236, 1.64561095, 0.80512597, 1.39450005, 1.01539775, 0.06…
gf_abline(slope = ~ b_N, intercept = ~ a, data = Prior5.5_draft, alpha = 0.6) %>%
  gf_lims(x = c(-5, 5), y = c(-5, 5))

A better prio

r
m5.5 <- 
  quap(
    alist(
      K ~ dnorm(mu, sigma),
      mu <- a + bN * N,
      a  ~ dnorm(0, 0.2),
      bN ~ dnorm(0, 0.5),
      sigma ~ dexp(1)
    ),
    data = MilkCC
  )
Prior5.5 <-
  extract.prior(m5.5, n = 100) %>%
  as.data.frame()

glimpse(Prior5.5)
## Rows: 100
## Columns: 3
## $ a     <dbl> -0.22454960, 0.07675617, 0.22894470, 0.02268500, 0.13165475, -0.…
## $ bN    <dbl> 0.27573545, 0.22638737, -0.04033851, -0.35714949, -0.68082106, 0…
## $ sigma <dbl> 0.61456371, 1.75763217, 0.79299614, 0.20033932, 0.16160024, 0.41…
gf_abline(slope = ~ bN, intercept = ~ a, data = Prior5.5, alpha = 0.6) %>%
  gf_lims(x = c(-3, 3), y = c(-3, 3))

precis(m5.5, digits = 3)
##             mean        sd       5.5%     94.5%
## a     0.03993977 0.1544866 -0.2069597 0.2868393
## bN    0.13323038 0.2237351 -0.2243415 0.4908022
## sigma 0.99975421 0.1646813  0.7365617 1.2629467
extract.samples(m5.5) %>% 
  gf_dens(~ bN)

Energy and (log) body mass

m5.6 <- 
  quap(
    alist(
      K ~ dnorm(mu, sigma),
      mu <- a + bM * M,
      a ~ dnorm(0, 0.2),
      bM ~ dnorm(0, 0.5),
      sigma ~ dexp(1)
    ),
    data = MilkCC
  )
m5.7 <- 
  quap(
    alist(
      K ~ dnorm(mu, sigma),
      mu <- a + bM * M + bN * N,
      a ~ dnorm(0, 0.2),
      bM ~ dnorm(0, 0.5),
      bN ~ dnorm(0, 0.5),
      sigma ~ dexp(1)
    ),
    data = MilkCC
  )

Comparing the 3 models

plot(coeftab(m5.5, m5.6, m5.7))

More isn’t always merrier

So far we have seen two examples (spurious association and masking) where adding in additional variables helps us understand what is going on. Unfortunately, adding variables to the model can actually make things worse in some situations. Stay tuned for more examples.