One of the big advantages of the Bayesian framework is its flexibility. Last time we say how to fit models of the form
\[\begin{align*} y &\sim \sf{Norm}(\mu, \sigma) \\ \mu &= f(x; \beta) \\ \beta_i &\sim ?? \\ \sigma &\sim ?? \end{align*}\] where \(f(x; \mathbf{\beta}) = \beta_0 + \beta_1 x\).
We can generalize this in several ways:
The regression function \(f(x; \theta)\) doesn’t need to be a linear function.1
\(f\) will still typically depend on a number of parameters (denoted collectively as \(\beta\) or \(\theta\), subscripted to get individual parameters) and our Bayesian engine will help us determine which values of these parameters are more or less plausible.
We don’t need to use a normal distribution to describe the individual level variability.
This is just another way of saying that we can use a different likelihood function.
We can choose different priors for any of the parameters involved.
We already know techniques for doing this, but we have more to learn more about how to choose the priors and what difference those choices make.
We can add additional predictors, including predictors that are not quantitative.
So far, we have seen one simple example of this – a simple quantitative predictor (weight) plus a binary predictor coded as 0 or 1 (sex).
Today’s focus is on item 1 above – working with other functions. In particular, we will be looking at polynomial regression
Let’s add the kids back into our Howell data.
library(rethinking)
data(Howell1)
gf_point(height ~ weight, data = Howell1, alpha = 0.7, shape = 21)
A line is not going to be a very good choice now since there is clearly a bend in the trend.
A constant standard deviation might also not be a good choice – there seems to be more variation for larger weights than for smaller weights.2 But for now we are focusing on the shape.
Polynomials, especially high degree polynomials are usually not the best choice for the model function when the relationship between \(x\) and \(y\) is more complex than a simple line. But a low degree polynomial (quadratic) can sometimes work well.
If there is some underlying mechanistic model explaining the phenomenon, it is best to reverse engineer that to create the model function than to simply grab a function becuase it is easy and happens to “fit well”.
If we just need a function that is “wigglier”, there are often better options than polynomials, we’ll see one of these next time.
Another way to improve a fit is to include additional predictors or to apply transformations to the variables. We will learn how to do that soon too.
We will look at polynomial regression first because it is relatively easy and does have some important applications, but it is not the only solution and often not the best one.
The natural way to express a quadratic model would be
\[ f(x; \beta) = \beta_0 + \beta_1 x + \beta_1 x^2 \]
But we will almost always want to standardize our variable when fitting a model like this. One reason is that when \(x\) is large, \(x^2\), \(x^3\), etc. will be enormous. Many numerical procedures work best when all the quantities in involved are on roughly the same, moderate scale. The most common form of standardization is the z-score:
\[ \mbox{z-score} = \frac{\mbox{value} - \mbox{mean value}}{\mbox{standard deviation of values}} \]
It is simplest to do this as a precalculation in the data – especially if we need to use the value in more than one place. We can make things a bit more efficient if we precompute all the squares of these standardized weights as well (else they will get recalculated for each iteration of the numerical algorithm used to estimate the posterior.)
Howell <-
Howell1 %>%
mutate(
weight_s = (weight - mean(weight)) / sd(weight),
weight_s2 = weight_s^2
)
m4.5 <-
quap(
data = Howell,
alist(
height ~ dnorm(mu, sigma),
mu <- beta_0 + beta_1 * weight_s + beta_2 * weight_s2,
beta_0 ~ dnorm(178, 20),
beta_1 ~ dlnorm(0, 1),
beta_2 ~ dnorm(0, 1),
sigma ~ dunif(0, 50)
)
)
Weights <-
tibble(
weight_s = seq(-3, 3, by = 0.1),
weight_s2 = weight_s^2)
Link4.5 <-
link(m4.5, data = Weights)
Sim4.5 <-
sim(m4.5, data = Weights)
Avg4.5 <-
apply(Link4.5, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
Ind4.5 <-
apply(Sim4.5, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
p4.5s <-
gf_ribbon(ymin + ymax ~ weight_s, data = Ind4.5, fill = "blue") %>%
gf_ribbon(ymin + ymax ~ weight_s, data = Avg4.5, fill = "red") %>%
gf_point(
title = "degree: 2",
height ~ weight_s, data = Howell, inherit = FALSE,
shape = 21, alpha = 0.4)
p4.5s
link()
and sim()
produce matrices with a row for each posterior sample and a column for each set of explanatory variables used. So when we run
Link4.5 <-
link(m4.5, data = Weights)
the reslution matrix has a column for each row of Weights
.
dim(Link4.5)
## [1] 1000 61
dim(Weights)
## [1] 61 2
Typically we want to run some computation along each column of the matrix to obtain the posterior mean or hdi or mode, etc. After getting this into a 61-row data frame, we can add on Weights
so we identify which row is which.
Avg4.5 <-
apply(Link4.5, 2, mean_hdi) %>% # a list of 61 1-row data frames
bind_rows() %>% # a single data frame with 61 rows
bind_cols(Weights) # Weights also has 61 rows, so we can add it on the right
Weights <-
Weights %>%
mutate(
weight = mean(Howell$weight) + weight_s * sd(Howell$weight)
)
Avg4.5 <-
apply(Link4.5, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
Ind4.5 <-
apply(Sim4.5, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
p4.5 <-
gf_ribbon(ymin + ymax ~ weight, data = Ind4.5, fill = "blue") %>%
gf_ribbon(ymin + ymax ~ weight, data = Avg4.5, fill = "red") %>%
gf_point(
title = "degree: 2",
height ~ weight, data = Howell, inherit = FALSE,
shape = 21, alpha = 0.4)
p4.5
This is clearly not a perfect model, but it is an improvement over a line.
Howell <-
Howell %>%
mutate(
weight_s3 = weight_s^3,
weight_s4 = weight_s^4
)
Weights <-
Weights %>%
mutate(
weight_s3 = weight_s^3,
weight_s4 = weight_s^4,
weight = mean(Howell$weight) + weight_s * sd(Howell$weight)
)
m4.6 <-
quap(
data = Howell,
alist(
height ~ dnorm(mu, sigma),
mu <- beta_0 + beta_1 * weight_s +
beta_2 * weight_s2 +
beta_3 * weight_s3,
beta_0 ~ dnorm(178, 20),
beta_1 ~ dlnorm(0, 1),
beta_2 ~ dnorm(0, 1),
beta_3 ~ dnorm(0, 1),
sigma ~ dunif(0, 50)
)
)
Link4.6 <- link(m4.6, data = Weights)
Sim4.6 <- sim(m4.6, data = Weights)
Avg4.6 <-
apply(Link4.6, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
Ind4.6 <-
apply(Sim4.6, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
p4.6 <-
gf_ribbon(ymin + ymax ~ weight, data = Ind4.6, fill = "blue") %>%
gf_ribbon(ymin + ymax ~ weight, data = Avg4.6, fill = "red") %>%
gf_point(
title = "degree: 3",
height ~ weight, data = Howell, inherit = FALSE,
shape = 21, alpha = 0.4)
m4.6a <-
quap(
data = Howell,
alist(
height ~ dnorm(mu, sigma),
mu <- beta_0 + beta_1 * weight_s +
beta_2 * weight_s2 +
beta_3 * weight_s3 +
beta_4 * weight_s4,
beta_0 ~ dnorm(178, 20),
beta_1 ~ dlnorm(0, 1),
beta_2 ~ dnorm(0, 1),
beta_3 ~ dnorm(0, 1),
beta_4 ~ dnorm(0, 1),
sigma ~ dunif(0, 50)
)
)
Link4.6a <- link(m4.6a, data = Weights)
Sim4.6a <- sim(m4.6a, data = Weights)
Avg4.6a <-
apply(Link4.6a, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
Ind4.6a <-
apply(Sim4.6a, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
p4.6a <-
gf_ribbon(ymin + ymax ~ weight, data = Ind4.6a, fill = "blue") %>%
gf_ribbon(ymin + ymax ~ weight, data = Avg4.6a, fill = "red") %>%
gf_point(
title = "degree: 4",
height ~ weight, data = Howell, inherit = FALSE,
shape = 21, alpha = 0.4)
m4.6l <-
quap(
data = Howell,
alist(
height ~ dnorm(mu, sigma),
mu <- beta_0 + beta_1 * weight_s,
beta_0 ~ dnorm(178, 20),
beta_1 ~ dlnorm(0, 1),
sigma ~ dunif(0, 50)
)
)
Link4.6l <- link(m4.6l, data = Weights)
Sim4.6l <- sim(m4.6l, data = Weights)
Avg4.6l <-
apply(Link4.6l, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
Ind4.6l <-
apply(Sim4.6l, 2, mean_hdi) %>%
bind_rows() %>%
bind_cols(Weights)
p4.6l <-
gf_ribbon(ymin + ymax ~ weight, data = Ind4.6l, fill = "blue") %>%
gf_ribbon(ymin + ymax ~ weight, data = Avg4.6l, fill = "red") %>%
gf_point(
title = "degree: 1",
height ~ weight, data = Howell, inherit = FALSE,
shape = 21, alpha = 0.4)
( p4.6l | p4.5 ) / (p4.6 | p4.6a)
precis(m4.6l) %>% pander()
mean | sd | 5.5% | 94.5% | |
---|---|---|---|---|
beta_0 | 138.3 | 0.4006 | 137.6 | 138.9 |
beta_1 | 25.94 | 0.4012 | 25.3 | 26.58 |
sigma | 9.346 | 0.2833 | 8.893 | 9.799 |
precis(m4.5) %>% pander()
mean | sd | 5.5% | 94.5% | |
---|---|---|---|---|
beta_0 | 146.1 | 0.369 | 145.5 | 146.6 |
beta_1 | 21.73 | 0.2889 | 21.27 | 22.19 |
beta_2 | -7.803 | 0.2742 | -8.242 | -7.365 |
sigma | 5.775 | 0.1765 | 5.493 | 6.057 |
precis(m4.6) %>% pander()
mean | sd | 5.5% | 94.5% | |
---|---|---|---|---|
beta_0 | 146.4 | 0.31 | 145.9 | 146.9 |
beta_1 | 15.22 | 0.4763 | 14.46 | 15.98 |
beta_2 | -6.205 | 0.2572 | -6.616 | -5.794 |
beta_3 | 3.584 | 0.2288 | 3.218 | 3.949 |
sigma | 4.83 | 0.147 | 4.595 | 5.065 |
precis(m4.6a) %>% pander()
mean | sd | 5.5% | 94.5% | |
---|---|---|---|---|
beta_0 | 145.5 | 0.3541 | 145 | 146.1 |
beta_1 | 16.18 | 0.5143 | 15.36 | 17 |
beta_2 | -3.758 | 0.551 | -4.639 | -2.878 |
beta_3 | 2.776 | 0.28 | 2.329 | 3.224 |
beta_4 | -0.9843 | 0.1958 | -1.297 | -0.6713 |
sigma | 4.84 | 0.148 | 4.603 | 5.077 |
Post4.6l <- m4.6l %>% extract.samples(1e4)
Post4.5 <- m4.5 %>% extract.samples(1e4)
Post4.6 <- m4.6 %>% extract.samples(1e4)
Post4.6a <- m4.6a %>% extract.samples(1e4)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:pander':
##
## wrap
ggpairs(Post4.6l, lower = list(continuous = "density"))
ggpairs(Post4.5, lower = list(continuous = "density"))
ggpairs(Post4.6, lower = list(continuous = "density"))
ggpairs(Post4.6a, lower = list(continuous = "density"))
Note: the word “linear” gets used in more than one way in regression. \(f(x) = \beta_0 \cdot \beta_1 x\) is linear in \(x\) and that’s probably what you think of when you think of linear. But \(g(x) = \beta_0 + \beta_1 x + \beta2 \log(x)\) is linear in the \(\beta\)’s, and that is usually more important. So a model with regression function \(g\) is also considered a linear model. This can be seen another way using linear algera: \(g(X) = \beta^{\top} \cdot X\) where \(X\) is a 3-column matrix with a column of 1’s a column of values of \(x\) and column of values of \(\log(x)\) and \(\beta\) is a 1-column matrix containing \(\beta_0\), \(\beta_1\), and \(\beta_2\).↩︎
Be careful here. Two things can deceive your eye. (1) Where there is more data, there are more chances to get values farther from average. This is not additional variability. (2) Variability always looks more pronounced where the trend is flatter and less pronounced where the trend is steeper. In this case, we have more data where the trend is flatter, so both factors are going to make the differences in variability look larger than they are.↩︎