I saw lots of good things on this test, and it also revealed some things that some of us don’t fully understand yet. Some of those things are to be expected early in the course – others are things we really want to get in place ASAP. Be sure to look over the comments below and the ones on your test.
This problem gave more people trouble than it should have. A few general comments:
Be path specific in your comments. Everything is about paths. Nodes are only pipes, or colliders, or forks relative to some path.
To show that a combination is not good, all you need to show is one path that is open but should be closed or one path that is closed that should be open. Conversely, to show that a combination is good, you need to argue that all paths are open/closed as required.
To be open, a path must be open each step of the way. If you close it anywhere along the way, the path is closed. In particular, you can condition on colliders in backdoor paths as long as those paths are closed by some other action.
Including a variable in the model is the same as conditioning on it.
Some of you referred to “removing” things. It isn’t clear to me what you mean by that.
Interaction leaves open the possibility that the slopes are different for the high school education group and the non-high school education group. Without interaction, the slopes will be the same (but the intercepts could be different). I was primarily looking to see that you made this distinction and justified your choice based on that, but without additional information, allowing for different slopes seems like the better way to go.
There are several options here depending on whether you used interaction or not and whether you used index variables (as I was intending) or indicator variables (which have some downsides both in terms of specifying priors).
How you choose priors depends on the way you set up your model and what sorts of centering or standardization you used. If you center the mom’s IQ (mom_iqc
) and set up your model with index variables and two intercepts (hs and non-hs) and two slopes (reflecting an interaction), the your model would include mu <- a[hs_idx] + b[hs_idx] * mom_iqc
, and you could reason about priots like this
The intercepts should be near 100, so dnorm(100, w)
makes sense for some small w
, smaller than 15 certainly, since we are more sure about the average being near 100 than we are about an individual IQ being near 100.
The slopes somewhere between 0 and 1 seem most reasonable. 1 would mean that for each increase of 1 point in the mother’s IQ you see a 1 point increase in the child’s IQ. 0 would mean the mother’s IQ is not associated with the child’s IQ. Your prior should at least get this order of magnitude correct.
\(\sigma\) should be positive (because it is a standard devition) and less than 15 (on the natural scale) if our model has any predictive value. Our prior should reflect that.
dexp(1)
, as the author often has. That prior allows for some pretty large standard deviations – probably unreasonably large – but since the author has been doing this and since we have talked the least about how to choose good priors for \sigma
, I was pretty generous about what sorts of priors were used here, as long as they forced \(\sigma\) to be positive and were not orders of magnitude off.The scoring scheme here was roughly:
4 pts for demonstrating that you understood what interaction meant and had a good reason for selecting whether or not to include interaction in your model
4 pts for setting up the non-prior parts of the model, making sure the you inclusing of interation matched what you said you would do.
6 pts for your choice of priors.
Since u2
has all the flexibility of u1
, plus some additional flexibility, we know it will fit the data at least as well as u1
does. Here are some ways we can see how much better:
The important thing here is to look at the posterior distribution for the difference, not to look at the overlap of the posterior distributions for the predicted performance of each employee.
The amount of variability from day to day is pretty similar to the amount of variability from employee to employee. Day seems to matter a bit more than employee. You could quantify this by looking at things like the posterior distributions of \(\sigma\) or \(R^2\).
This went pretty well for most of you.
In addtion to the notes in gradescope, some general comments:
The first two model make essentially identical predictions, so the choice of prior is just affecting how the parameterization works.
Don’t focus exclusively on the means of distributions; always consider the uncertinty/variability as well.
In 7, there is enough wiggle room in the prior that likely we would get posterior distributions centered at values somewhere around 60 for both day and employee. The overall model would likely make similar predictions to those made by u6 and u7. But in 8, we would need to get 5 or 6 standard deviations away from the center of the prior for each of our parameters. It would take a LOT of data to convince our model to do that. Very likely one of two things will happen: (1) The model will perform very badly (poor n_eff, bad mixing, etc.) because it is struggling to find somewhere where both the prior and the likelihood are relatively large. or (2) the prior will act as a strong regularizer and end up giving estimates for the number of pallets repaired that are much lower than the data suggest (because we have told the model to be very skeptical of those larger values). The posterior values for sigma would likely be larger as well (to make response values farther from the prediction less unusual). Note: You can figure this out just by thinking things through, but there was nothing stopping you from fitting the model to see for yourself. (Give it a try if you want to see how it actually works out.) *
Some of you became distracted by words like “impact” and “interaction” which are not really relavant here.
u6
and u7
are better than u4
and u5
both based on how well they fit and based on likely unreasonable aspects of models u4
and u5
(linear or quadratic relationship between number of pallets repaired and day).
If you use plots to assess the models, the plots should probably include both model predictions and data, so you can see how they compare.
Again, make comparisons by looking at posterior distributions of differences.
Some of you didn’t explain why the new model is better than the old model. The main reason is that including day in the model allows us to compare employees conditional on day. Essentially this reduces the impact of day-to-day variation and makes it easier to see the employee-to-employee differences.
It may well be that if we take a random day for employee X and a random day for employee Y that either one might fix more pallets, but if we pick the same day for each employee, one of them predictably nearly always does better than the other.
Note: there are multiple possible explanations for the day-to-day variability. Some of you posited particular explanations (like weekend effects, etc.) It could also be that the pallets were in worse condition on some days than on others, slowing everyone down because there was more work to be done on each pallet (on average).