This is still a work in progress. Send feedback to rpruim@calvin.edu
.
Some things to think about as you create a (Bayesian) model1.
Note: This list presumes that you already understand your data – what the variables are, what types they have, distribution and scale of the variables, etc. If you are designing a study yourself, a lot of effort goes into figuring how what sort of data to get and how to get it. If you are using existing data, you still need to figure out what you have, and if it is suited to your particular task.
What is the purpose of the model?
What is the response variable and primary predictor(s)?
Are there covariates?
covariate = additional variable that is not of primary interest, but might need to be included to give us a clearer picture of what’s going on.
DAGs can be useful here, especially if we are looking for causal relationships.
Should we transform any of the variables?
Centering and/or rescaling
(Nonlinear) transformations can be used to make “curvy” relationships “straight(er)”.
Transformations can change the family of distirbutions that are reasonable.
Common transformations include log and square root – especially log.
log transformations are useful when things are “proportional” as opposed to “additive”.
For predictors: Would you expect the same effect each time you add 1 to the predictor or each time you double the predictor?
For response: would expect a fixed amount change in a predictor to produce an absolute amount of change in the response or a proportional (percentage) change?
What family is the marginal distribution of the response, given values of all the predictors?
Common choices for continuous response: normal, student t.
Common choices for count data: binomial (with logit link), Poison (with log link).
(Many) other choices are possible here.
Note: This is not the same thing as the distribution of the response variable in the data.
“Regression” equation: How might the parameters of the marginal response distribution be related to the predictors?
Multilevel model?
McElreath’s view: Multi-level should be the default approach.
It isn’t really any harder to do a multi-level model once you understand what they are.
Should I reparameterize my model?
Most common reasons: (a) to help Stan – or some other algorithm – work better, or (b) to make the model easier to interpret.
Can use “generated quantities” (gq>
) to compute functions of parameters and record them in the posterior. (Note: You may need to tell Stan about the types.)
What priors should we use for the parameters and hyper-parameters?
How well is the model working?
Keep in mind that regularization intentionally causes models to underfit the training data in hopes of making better predictions.
Can I explain my model (to someone else)?
Many of the items on this checklist are valid for other types of models as well.↩︎