Goals for today

One of the big advantages of the Bayesian framework is its flexibility. Last time we say how to fit models of the form

\[\begin{align*} y &\sim \sf{Norm}(\mu, \sigma) \\ \mu &= f(x; \beta) \\ \beta_i &\sim ?? \\ \sigma &\sim ?? \end{align*}\] where \(f(x; \mathbf{\beta}) = \beta_0 + \beta_1 x\).

We can generalize this in several ways:

  1. The regression function \(f(x; \theta)\) doesn’t need to be a linear function.1

    \(f\) will still typically depend on a number of parameters (denoted collectively as \(\beta\) or \(\theta\), subscripted to get individual parameters) and our Bayesian engine will help us determine which values of these parameters are more or less plausible.

  2. We don’t need to use a normal distribution to describe the individual level variability.

    This is just another way of saying that we can use a different likelihood function.

  3. We can choose different priors for any of the parameters involved.

    We already know techniques for doing this, but we have more to learn more about how to choose the priors and what difference those choices make.

  4. We can add additional predictors, including predictors that are not quantitative.

    So far, we have seen one simple example of this – a simple quantitative predictor (weight) plus a binary predictor coded as 0 or 1 (sex).

Today’s focus is on item 1 above – working with other functions. In particular, we will be looking at polynomial regression

Polynomials

Let’s add the kids back into our Howell data.

library(rethinking)
data(Howell1)
gf_point(height ~ weight, data = Howell1, alpha = 0.7, shape = 21)

A line is not going to be a very good choice now since there is clearly a bend in the trend.

A constant standard deviation might also not be a good choice – there seems to be more variation for larger weights than for smaller weights.2 But for now we are focusing on the shape.

Notes

  1. Polynomials, especially high degree polynomials are usually not the best choice for the model function when the relationship between \(x\) and \(y\) is more complex than a simple line. But a low degree polynomial (quadratic) can sometimes work well.

  2. If there is some underlying mechanistic model explaining the phenomenon, it is best to reverse engineer that to create the model function than to simply grab a function becuase it is easy and happens to “fit well”.

  3. If we just need a function that is “wigglier”, there are often better options than polynomials, we’ll see one of these next time.

  4. Another way to improve a fit is to include additional predictors or to apply transformations to the variables. We will learn how to do that soon too.

We will look at polynomial regression first because it is relatively easy and does have some important applications, but it is not the only solution and often not the best one.

The Quadratic Model

The natural way to express a quadratic model would be

\[ f(x; \beta) = \beta_0 + \beta_1 x + \beta_1 x^2 \]

But we will almost always want to standardize our variable when fitting a model like this. One reason is that when \(x\) is large, \(x^2\), \(x^3\), etc. will be enormous. Many numerical procedures work best when all the quantities in involved are on roughly the same, moderate scale. The most common form of standardization is the z-score:

\[ \mbox{z-score} = \frac{\mbox{value} - \mbox{mean value}}{\mbox{standard deviation of values}} \]

It is simplest to do this as a precalculation in the data – especially if we need to use the value in more than one place. We can make things a bit more efficient if we precompute all the squares of these standardized weights as well (else they will get recalculated for each iteration of the numerical algorithm used to estimate the posterior.)

Howell <- 
  Howell1 %>%
  mutate(
    weight_s = (weight - mean(weight)) / sd(weight),
    weight_s2 = weight_s^2
  )

m4.5 <-
  quap(
    data = Howell,
    alist(
      height ~ dnorm(mu, sigma),
      mu    <- beta_0 + beta_1 * weight_s + beta_2 * weight_s2,
      beta_0 ~ dnorm(178, 20),
      beta_1 ~ dlnorm(0, 1),
      beta_2 ~ dnorm(0, 1),
      sigma  ~ dunif(0, 50)
    )
  )
Weights <-
  tibble(
    weight_s = seq(-3, 3, by = 0.1),
    weight_s2 = weight_s^2)

Link4.5 <- 
  link(m4.5, data = Weights)
Sim4.5 <- 
  sim(m4.5, data = Weights)

Avg4.5 <- 
  apply(Link4.5, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)

Ind4.5 <- 
  apply(Sim4.5, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)

p4.5s <-
  gf_ribbon(ymin + ymax ~ weight_s, data = Ind4.5, fill = "blue")  %>%
  gf_ribbon(ymin + ymax ~ weight_s, data = Avg4.5, fill = "red") %>%
  gf_point(
    title = "degree: 2",
    height ~ weight_s, data = Howell, inherit = FALSE,
    shape = 21, alpha = 0.4)
p4.5s

Plotting on the natural scale

Weights <-
  Weights %>% 
  mutate(
    weight = mean(Howell$weight) + weight_s * sd(Howell$weight)
  )

Avg4.5 <- 
  apply(Link4.5, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)

Ind4.5 <- 
  apply(Sim4.5, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)

p4.5 <-
  gf_ribbon(ymin + ymax ~ weight, data = Ind4.5, fill = "blue")  %>%
  gf_ribbon(ymin + ymax ~ weight, data = Avg4.5, fill = "red") %>%
  gf_point(
    title = "degree: 2",
    height ~ weight, data = Howell, inherit = FALSE,
    shape = 21, alpha = 0.4)
p4.5

This is clearly not a perfect model, but it is an improvement over a line.

More polynomials

Howell <- 
  Howell %>%
  mutate(
    weight_s3 = weight_s^3,
    weight_s4 = weight_s^4
  )
Weights <-
  Weights %>% 
  mutate(
    weight_s3 = weight_s^3,
    weight_s4 = weight_s^4,
    weight = mean(Howell$weight) + weight_s * sd(Howell$weight)
  )
m4.6 <-
  quap(
    data = Howell,
    alist(
      height ~ dnorm(mu, sigma),
      mu    <- beta_0 + beta_1 * weight_s + 
                        beta_2 * weight_s2 + 
                        beta_3 * weight_s3,
      beta_0 ~ dnorm(178, 20),
      beta_1 ~ dlnorm(0, 1),
      beta_2 ~ dnorm(0, 1),
      beta_3 ~ dnorm(0, 1),
      sigma  ~ dunif(0, 50)
    )
  )


Link4.6 <- link(m4.6, data = Weights)
Sim4.6 <- sim(m4.6, data = Weights)
Avg4.6 <- 
  apply(Link4.6, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)
Ind4.6 <- 
  apply(Sim4.6, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)
p4.6 <-
  gf_ribbon(ymin + ymax ~ weight, data = Ind4.6, fill = "blue")  %>%
  gf_ribbon(ymin + ymax ~ weight, data = Avg4.6, fill = "red") %>%
  gf_point(
    title = "degree: 3",
    height ~ weight, data = Howell, inherit = FALSE,
    shape = 21, alpha = 0.4)   
m4.6a <-
  quap(
    data = Howell,
    alist(
      height ~ dnorm(mu, sigma),
      mu    <- beta_0 + beta_1 * weight_s + 
                        beta_2 * weight_s2 + 
                        beta_3 * weight_s3 + 
                        beta_4 * weight_s4,
      beta_0 ~ dnorm(178, 20),
      beta_1 ~ dlnorm(0, 1),
      beta_2 ~ dnorm(0, 1),
      beta_3 ~ dnorm(0, 1),
      beta_4 ~ dnorm(0, 1),
      sigma  ~ dunif(0, 50)
    )
  )
Link4.6a <- link(m4.6a, data = Weights)
Sim4.6a <- sim(m4.6a, data = Weights)
Avg4.6a <- 
  apply(Link4.6a, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)
Ind4.6a <- 
  apply(Sim4.6a, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)
p4.6a <- 
  gf_ribbon(ymin + ymax ~ weight, data = Ind4.6a, fill = "blue")  %>%
  gf_ribbon(ymin + ymax ~ weight, data = Avg4.6a, fill = "red") %>%
  gf_point(
    title = "degree: 4",
    height ~ weight, data = Howell, inherit = FALSE,
    shape = 21, alpha = 0.4) 
m4.6l <-
  quap(
    data = Howell,
    alist(
      height ~ dnorm(mu, sigma),
      mu    <- beta_0 + beta_1 * weight_s,  
      beta_0 ~ dnorm(178, 20),
      beta_1 ~ dlnorm(0, 1),
      sigma  ~ dunif(0, 50)
    )
  )
Link4.6l <- link(m4.6l, data = Weights)
Sim4.6l <- sim(m4.6l, data = Weights)
Avg4.6l <- 
  apply(Link4.6l, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)
Ind4.6l <- 
  apply(Sim4.6l, 2, mean_hdi) %>% 
  bind_rows() %>%
  bind_cols(Weights)
p4.6l <-
  gf_ribbon(ymin + ymax ~ weight, data = Ind4.6l, fill = "blue")  %>%
  gf_ribbon(ymin + ymax ~ weight, data = Avg4.6l, fill = "red") %>%
  gf_point(
    title = "degree: 1",
    height ~ weight, data = Howell, inherit = FALSE,
    shape = 21, alpha = 0.4) 
( p4.6l | p4.5 ) / (p4.6 | p4.6a)

Inspecting posterior distributions of parameters

precis(m4.6l) %>% pander()
  mean sd 5.5% 94.5%
beta_0 138.3 0.4006 137.6 138.9
beta_1 25.94 0.4012 25.3 26.58
sigma 9.346 0.2833 8.893 9.799
precis(m4.5) %>% pander()
  mean sd 5.5% 94.5%
beta_0 146.1 0.369 145.5 146.6
beta_1 21.73 0.2889 21.27 22.19
beta_2 -7.803 0.2742 -8.242 -7.365
sigma 5.775 0.1765 5.493 6.057
precis(m4.6) %>% pander()
  mean sd 5.5% 94.5%
beta_0 146.4 0.31 145.9 146.9
beta_1 15.22 0.4763 14.46 15.98
beta_2 -6.205 0.2572 -6.616 -5.794
beta_3 3.584 0.2288 3.218 3.949
sigma 4.83 0.147 4.595 5.065
precis(m4.6a) %>% pander()
  mean sd 5.5% 94.5%
beta_0 145.5 0.3541 145 146.1
beta_1 16.18 0.5143 15.36 17
beta_2 -3.758 0.551 -4.639 -2.878
beta_3 2.776 0.28 2.329 3.224
beta_4 -0.9843 0.1958 -1.297 -0.6713
sigma 4.84 0.148 4.603 5.077
Post4.6l <- m4.6l %>% extract.samples(1e4)
Post4.5 <- m4.5 %>% extract.samples(1e4)
Post4.6 <- m4.6 %>% extract.samples(1e4)
Post4.6a <- m4.6a %>% extract.samples(1e4)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:pander':
## 
##     wrap
ggpairs(Post4.6l, lower = list(continuous = "density"))

ggpairs(Post4.5, lower = list(continuous = "density"))

ggpairs(Post4.6, lower = list(continuous = "density"))

ggpairs(Post4.6a, lower = list(continuous = "density"))

Notes


  1. Note: the word “linear” gets used in more than one way in regression. \(f(x) = \beta_0 \cdot \beta_1 x\) is linear in \(x\) and that’s probably what you think of when you think of linear. But \(g(x) = \beta_0 + \beta_1 x + \beta2 \log(x)\) is linear in the \(\beta\)’s, and that is usually more important. So a model with regression function \(g\) is also considered a linear model. This can be seen another way using linear algera: \(g(X) = \beta^{\top} \cdot X\) where \(X\) is a 3-column matrix with a column of 1’s a column of values of \(x\) and column of values of \(\log(x)\) and \(\beta\) is a 1-column matrix containing \(\beta_0\), \(\beta_1\), and \(\beta_2\).↩︎

  2. Be careful here. Two things can deceive your eye. (1) Where there is more data, there are more chances to get values farther from average. This is not additional variability. (2) Variability always looks more pronounced where the trend is flatter and less pronounced where the trend is steeper. In this case, we have more data where the trend is flatter, so both factors are going to make the differences in variability look larger than they are.↩︎