Multiple Regression

Continuous Interaction: Tulip Time

The previous example was particularly simple because there were only two possible values for one of the our predictors. We would like to generalize the concept of interaction to include

more than two possible values for a categorical predictor
numerical predictors

Look at the tulips data, which were adapted from @grafenModernStatisticsLife2002.

Additive models

Prep the data

data(tulips, package = "rethinking")
Tulips <-
  tulips %>% 
  mutate(
    blooms_std = blooms / max(blooms),
    water_cent = water - mean(water),
    shade_cent = shade - mean(shade))

With the variables in hand, the basic model is

\[ B = f(W, S) \;, \] where \(B\) = blooms, \(W\) = water, \(S\) = shade, and \(f(\cdot)\) indicates a function. We can also express this as \(W \rightarrow B \leftarrow S\). Neither expression clarifies whether the effects of \(W\) and \(S\) are additive or conditional on each other in some way. We might express an unconditional (additive) model as

\[\begin{align*} \text{blooms_std}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \beta_1 \text{water_cent}_i + \beta_2 \text{shade_cent}_i \\ \alpha & \sim \operatorname{Normal}(0.5, 1) \\ \beta_1 & \sim \operatorname{Normal}(0, 1) \\ \beta_2 & \sim \operatorname{Normal}(0, 1) \\ \sigma & \sim \operatorname{Exponential}(1), \end{align*}\]

where \(\text{water_cent}_i = \left (\text{water}_i - \overline{\text{water}} \right )\) and \(\text{shade_cent}_i = \left (\text{shade}_i - \overline{\text{shade}} \right )\).

Improving the prior

Even though “the intercept \(\alpha\) must be greater than zero and less than one, … this prior assigns most of the probability outside that range” (p. 254).

If we want 95% of a normal distribution to live on the interval \([0,1]\), then we could choose a mean of 0.5 and a standard deviation of 0.25.

pnorm(1, 0.5, 1) - pnorm(0, 0.5, 1)

## [1] 0.3829249

pnorm(1, 0.5, 0.25) - pnorm(0, 0.5, 0.25)

## [1] 0.9544997

Our two predictors take on values 1, 2, or 3 with equal probability. These are rescaled to take values -1, 0, or 1.

gf_histogram( ~ water, data = Tulips)

gf_histogram( ~ shade, data = Tulips)

range(Tulips$water_cent)

## [1] -1  1

range(Tulips$shade_cent)

## [1] -1  1

Although the experiment only used 3 values for each of these, it is possible to supply intermediate amounts of water or shade, and our model will be able to make predictions for those if we treat the predictors as numerical variables.

For priors on the slopes, we can think about a very strong effect of one of these preditors. A slope of 0.5 would move our predicted mean from the smallest observed blooms_std to the largest. Slopes larger than that would be quite extreme. So again, a standard deviation of 0.25 seems reasonable.

\[\begin{align*} \text{blooms_std}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \beta_1 \text{water_cent}_i + \beta_2 \text{shade_cent}_i \\ \alpha & \sim \operatorname{Normal}(0.5, 0.25) \\ \beta_1 & \sim \operatorname{Normal}(0, 0.25) \\ \beta_2 & \sim \operatorname{Normal}(0, 0.25) \\ \sigma & \sim \operatorname{Exponential}(1). \end{align*}\]

u8.4 <-
  ulam(
    data = Tulips,
    alist(
      blooms_std ~ dnorm(mu, sigma),
      mu <- a + bw * water_cent + bs * shade_cent,
      a ~ dnorm(0, 0.25),
      c(bw, bs) ~ dnorm(0, 0.25),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 8,
    file = "fits/u8.4")

Check the model summary.

precis(u8.4)

##             mean         sd       5.5%       94.5%    n_eff     Rhat4
## a      0.3503720 0.03502050  0.2933151  0.40635720 4694.069 0.9998423
## bs    -0.1119084 0.04181169 -0.1783555 -0.04543695 4475.947 1.0000046
## bw     0.2036052 0.04093695  0.1370650  0.26923754 4425.063 0.9994021
## sigma  0.1766912 0.02733281  0.1392921  0.22430748 3558.581 0.9996895

stanfit(u8.4) %>% mcmc_areas(pars = vars(-lp__))

A model with interaction

So what does interaction mean here? We want to modify

\[ \mu_i = \beta_0 + \beta_W W_i + \beta_S S_i \\ \] to express that the effect of water on blossoms (\(\beta_W\)) should depend (linearly for now) on the amount of shade (\(S\)). But we know how to express something that depends (linearly) on the amount of shade: \(\beta_0 + \beta_S S\). So let’s do a little renaming and insert that into our model:

\[\begin{align*} \mu_i &= \beta_0 + (\beta_W + \beta_{SW} S_i) W_i + \beta_S S_i \\ &= \beta_0 + \beta_W W_i + \beta_{SW} S_i W_i + \beta_S S_i \\ &= \beta_0 + \beta_W W_i + (\beta_S + \beta_{SW} W_i) S_i \end{align*}\]

The algebraic rearrangement (a) explains our choice of subscript notation,
(b) shows the form in which this sort of interaction model is often predicted, (c) makes it clear that this is symmetric in \(S\) and \(W\), and (d) helps us understand what the parameters represent and how we might choose priors.

Note: McElreath also introduces another notation:

\(\gamma_{W,i} = \beta_S + \beta_{SW} S_i\)
\(\gamma_{S,i} = \beta_W + \beta_{SW} W_i\)

These are the parenthesized expressions in the equations above.

Using the \(\gamma\) notation, we can express an interaction between water_cent and shade_cent by

\[\begin{align*} \mu_i & = \alpha + \color{#009E73}{\gamma_{1, i}} \text{water_cent}_i + \beta_2 \text{shade_cent}_i \\ \color{#009E73}{\gamma_{1, i}} & \color{#009E73}= \color{#009E73}{\beta_1 + \beta_3 \text{shade_cent}_i}, \end{align*}\]

where both \(\mu_i\) and \(\gamma_{1, i}\) get a linear model. We could do the converse by switching the positions of water_cent and shade_cent. If we substitute the equation for \(\gamma_{1, i}\) into the equation for \(\mu_i\), we get

\[\begin{align*} \mu_i & = \alpha + \color{#009E73}{\underbrace{(\beta_1 + \beta_3 \text{shade_cent}_i)}_{\gamma_{1, i}}} \text{water_cent}_i + \beta_2 \text{shade_cent}_i \\ & = \alpha + \color{#009E73}{\beta_1} \text{water_cent}_i + (\color{#009E73}{\beta_3 \text{shade_cent}_i} \cdot \text{water_cent}_i) + \beta_2 \text{shade_cent}_i \\ & = \alpha + \color{#009E73}{\beta_1} \text{water_cent}_i + \beta_2 \text{shade_cent}_i + \color{#009E73}{\beta_3} (\color{#009E73}{\text{shade_cent}_i} \cdot \text{water_cent}_i), \end{align*}\]

where \(\beta_3\) is the interaction term which makes water_cent and shade_cent conditional on each other. If we use the same priors as before, we might write the full equation for our interaction model as

\[\begin{align*} \text{blooms_std}_i & \sim \operatorname{Normal}(\mu_i, \sigma) \\ \mu_i & = \alpha + \color{#009E73}{\beta_1} \text{water_cent}_i + \beta_2 \text{shade_cent}_i + \color{#009E73}{\beta_3 \text{shade_cent}_i} \cdot \text{water_cent}_i\\ \beta_0 & \sim \operatorname{Normal}(0.5, 0.25) \\ \beta_S & \sim \operatorname{Normal}(0, 0.25) \\ \beta_W & \sim \operatorname{Normal}(0, 0.25) \\ \beta_{SW} & \sim \operatorname{Normal}(0, 0.25) \\ \sigma & \sim \operatorname{Exponential}(1). \end{align*}\]

Fit the model.

u8.5 <-
  ulam(
    data = Tulips,
    alist(
      blooms_std ~ dnorm(mu, sigma),
      mu <- b0 + bw * water_cent + bs * shade_cent + bws * water_cent * shade_cent,
      b0 ~ dnorm(0, 0.25),
      c(bw, bs, bws) ~ dnorm(0, 0.25),
      sigma ~ dexp(1)
    ),
    iter = 2000, warmup = 1000, chains = 4, cores = 4,
    seed = 8,
    file = "fits/u8.5")

precis(u8.5)

##             mean         sd       5.5%       94.5%    n_eff     Rhat4
## b0     0.3522605 0.02780488  0.3073529  0.39693740 4444.975 1.0006375
## bws   -0.1423041 0.04244937 -0.2094041 -0.07486695 4279.999 1.0003200
## bs    -0.1122193 0.03474999 -0.1671495 -0.05603002 4351.081 0.9993499
## bw     0.2060801 0.03371968  0.1520878  0.25853715 4675.903 1.0014986
## sigma  0.1433543 0.02316433  0.1117857  0.18362052 3683.029 1.0003516

stanfit(u8.5) %>% mcmc_areas(pars = vars(-lp__))

Plotting posterior predictions

Making a plot like figure 8.7 takes a little thinking to get it right. Let’s start by taking a look at the plot, and then figure out how to make it.

# this function converts a matrix to a 3-column data frame
#   think: row, column, value

mat2df <- function(m, rname = "row", cname = "col", vname = "value") {
  
  if (is.null(colnames(m))) {
    colnames(m) <- 1:ncol(m)
  }
  
  if (is.null(rownames(m))) {
    rownames(m) <- 1:nrow(m)
  }
  
  m %>% 
    as_tibble() %>%
    tibble::rownames_to_column(var = rname) %>%
    pivot_longer(! any_of(rname), names_to = cname, values_to = vname)
}

# example use
M <- matrix(1:12, nrow = 3)
M

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

M %>% mat2df(rname = "R", cname = "C", vname = "number")

## # A tibble: 12 x 3
##    R     C     number
##    <chr> <chr>  <int>
##  1 1     1          1
##  2 1     2          4
##  3 1     3          7
##  4 1     4         10
##  5 2     1          2
##  6 2     2          5
##  7 2     3          8
##  8 2     4         11
##  9 3     1          3
## 10 3     2          6
## 11 3     3          9
## 12 3     4         12

CFD <-
  expand_grid(
    shade_cent = -1:1,
    water_cent = -1:1
  )  %>%
  mutate(scenario = as.character(1:n()))
  
L8.5 <-
  link(u8.5, data = CFD, n = 20)

# a bug? -- we are getting 4000 rows but we asked for 20
str(L8.5)

##  num [1:4000, 1:9] 0.142 0.158 0.1 0.193 0.1 ...

# but it is easy enough to grab 20 randomly
p8.5 <-
  L8.5[sample(1:4000, 20), ] %>%
  mat2df(cname = "scenario", vname = "predicted_blooms_std") %>%
  left_join(CFD) %>%
  gf_line(predicted_blooms_std ~ shade_cent, group = ~row,
          color = "steelblue", alpha = 0.7) %>%
  gf_facet_grid( ~ water_cent, labeller = label_both) %>%
  gf_labs(
    title = "Model 8.5",
    subtitle = "posterior"
  )

## Joining, by = "scenario"

L8.4 <-
  link(u8.4, data = CFD, n = 20)

p8.4 <- 
  L8.4[sample(1:4000, 20), ] %>%
  mat2df(cname = "scenario", vname = "predicted_blooms_std") %>%
  left_join(CFD) %>%
  gf_line(predicted_blooms_std ~ shade_cent, group = ~row,
          color = "steelblue", alpha = 0.7) %>%
  gf_facet_grid( ~ water_cent, labeller = label_both) %>%
  gf_labs(
    title = "Model 8.4",
    subtitle = "posterior"
  )

## Joining, by = "scenario"

p8.4 / p8.5

Plotting prior predictions

We can make the same sort of plots from the priors as well – and maybe should have done so earlier. We just provide link() with a matrix of prior samples.

Unfortunately, I haven’t yet figured out a more efficient way to generate prior samples from an ulam object. This way requires that the Stan model be recompiled, which is unfortunate. I’m guessing this is because ulam() hard codes in the number of rows in the data set (see Stan code below), so we can’t just send in empty data and resample. [You can do this in native Stan, but they you need to pass in the sample size as well as the data so that sample size can be used to declare the relevant variables.]

L8.5prior <-
  link(u8.5, data = CFD, post = extract.prior(u8.5))

## Trying to compile a simple C file

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## 
## SAMPLING FOR MODEL 'f3f053baef8ef4204d9c3c804be8bc80' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.6e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.042763 seconds (Warm-up)
## Chain 1:                0.033694 seconds (Sampling)
## Chain 1:                0.076457 seconds (Total)
## Chain 1:

L8.4prior <-
  link(u8.4, data = CFD, post = extract.prior(u8.4))

## Trying to compile a simple C file

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## 
## SAMPLING FOR MODEL 'eb86daf23abc19774302bed503fea070' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.032866 seconds (Warm-up)
## Chain 1:                0.024415 seconds (Sampling)
## Chain 1:                0.057281 seconds (Total)
## Chain 1:

p8.5prior <-
  L8.5prior[sample(1:1000, 20), ] %>%
  mat2df(cname = "scenario", vname = "predicted_blooms_std") %>%
  left_join(CFD) %>%
  gf_line(predicted_blooms_std ~ shade_cent,  group = ~row,
          color = "steelblue", alpha = 0.7) %>%
  gf_facet_grid( ~ water_cent, labeller = label_both) %>%
  gf_labs(
    title = "Model 8.5",
    subtitle = "prior"
  )

## Joining, by = "scenario"

p8.4prior <- 
  L8.4prior[ sample(1:1000, 20), ] %>% 
  mat2df(cname = "scenario", vname = "predicted_blooms_std") %>%
  left_join(CFD) %>%
  gf_line(predicted_blooms_std ~ shade_cent, group = ~row,
          color = "steelblue", alpha = 0.7) %>%
  gf_facet_grid( ~ water_cent, labeller = label_both) %>%
  gf_labs(
    title = "Model 8.4",
    subtitle = "prior"
  )

## Joining, by = "scenario"

p8.4prior / p8.5prior

What can we say about these priors, overall? They are harmless, but only weakly realistic. Most of the lines stay within the valid outcome space. But silly trends are not rare. We could do better. We could also do a lot worse, such as flat priors which would consider plausible that even a tiny increase in shade would kill all the tulips. If you displayed these priors to your colleagues, a reasonable summary might be, “These priors contain no bias towards positive or negative effects, and at the same time they very weakly bound the effects to realistic ranges.” (p. 260)

Using quap() to get ulam() priors?

I need to do some mroe testing to make sure this does what I think it does, but it seems like this should work:

P <- quap(data = Tulips, u8.5@formula) %>% extract.prior()
L8.5prior2 <-
  link(u8.5, post = P, data = CFD)

p8.4prior / 
L8.5prior2[sample(1:1000, 20), ] %>%
  mat2df(cname = "scenario", vname = "predicted_blooms_std") %>%
  left_join(CFD) %>%
  gf_line(predicted_blooms_std ~ shade_cent,  group = ~row,
          color = "steelblue", alpha = 0.7) %>%
  gf_facet_grid( ~ water_cent, labeller = label_both) %>%
  gf_labs(
    title = "Model 8.5",
    subtitle = "prior via quap()"
  )

## Joining, by = "scenario"

Multiple Regression

Continuous Interaction

Stat 341

Goals for today