class: center, middle, inverse, title-slide # The Whickham Study ## Smoking and Life Expectency ### Stat 145 --- --- # The Whickham survey The Whickham survey was conducted in 1972--74 and a follow-up of the same people occurred 20 years later. One of the things recorded was whether or not someone was a smoker at baseline (ie, when the study started in 1972--74) and whether they were still alive 20 years later. You can read more about the study using `?Whickham`. --- # Hypotheses Let's see whether this study provides evidence that non-smokers are more likely to survive than non-smokers. **Q**: What are the null and alternative hypotheses for this situation? -- * `\(H_0\)`: `\(p_s = p_n\)` (The proportion of smokers who are still alive is equal to the proportion of non-smokers who are still alive.) * `\(H_a\)`: `\(p_s < p_n\)` (The proportion of smokers who are still alive is less than the proportion of non-smokers who are still alive.) -- One-sided test makes sense here * We know which side to choose without referencing the data * We might want to confirm our suspicion that smoking is bad for your health -- Two-side would also be possible --- # Looking at the data Looking at the data, we immediately notice something unexpected. ```r df_stats( outcome ~ smoker, data = Whickham, props) ``` ``` ## response smoker prop_Alive prop_Dead ## 1 outcome No 0.6857923 0.3142077 ## 2 outcome Yes 0.7611684 0.2388316 ``` ```r gf_props( ~ outcome | smoker, data = Whickham) ``` ![](Whickham-Smokers_files/figure-html/Whickham-props-1.png)<!-- --> -- **Smokers in this study are more likely to be alive after 20 years than non-smokers!** --- # P-value note **Q**: Do we have any evidence that non-smokers live longer than smokers (from this study)? -- **A**: No, the smokers in the study were more likely to be alive. -- So our original 1-sided test should have a large p-value (greater than 0.5) --- # Null Distrbution Let's create a null distibution in our usual way. ```r set.seed(2468) test_stat <- diffprop( outcome ~ smoker, data = Whickham); test_stat ``` ``` ## diffprop ## 0.07537604 ``` ```r Smoking_null <- do(1000) * diffprop( outcome ~ shuffle(smoker), data = Whickham) ``` --- # Looking at the Null Distribution ![](Whickham-Smokers_files/figure-html/Whickham-dotplot-1.png)<!-- --> **Q**: What is the p-value here? --- # Computing the p-value ------------------------------------ compare(diffprop, test_stat, n verbose = TRUE) ------------------------------ ----- diffprop < test_stat 998 diffprop = test_stat 1 diffprop > test_stat 1 ------------------------------------ -- Notice that `diffprop()` is computing `\(p_s - p_n\)` and our p-value is 999/1000 = 0.999. That's a very large p-value, which makes sense since the data do not provide any evidence to support the hypothesis that non-smokers live longer than smokers. --- # 3 p-values Depending our our alternative hypothesis, there are 3 possible p-values. ### `\(H_a: p_s - p_n \neq 0\)` If our alternative is 2-sided, we need to double the tail probability to get our p-value. ```r gf_histogram( ~ diffprop, data = Smoking_null) ``` ![](Whickham-Smokers_files/figure-html/Whickham-prop-hist-1.png)<!-- --> ```r 2 * prop( ~(diffprop >= test_stat), data = Smoking_null) ``` ``` ## prop_TRUE ## 0.004 ``` --- # 3 p-values ### `\(H_a: p_s - p_n < 0\)`, This is a natural 1-sided alternative (that smokers are less likely to be alive at the follow-up). The p-value in this case is very large since we don't have any evidence that non-smokers are *more* likely to survive. ```r prop( ~(diffprop <= test_stat), data = Smoking_null) ``` ``` ## prop_TRUE ## 0.999 ``` --- # 3 p-values ### `\(H_a: p_s - p_n > 0\)` Now the 1-tailed p-value is ```r prop( ~(diffprop >= test_stat), data = Smoking_null) ``` ``` ## prop_TRUE ## 0.002 ``` -- But this probably isn't the alternative that any of us had in mind *before looking at the data*, and we are not allowed to select the alternative based on the data. --- # Why the small 2-sided p-value? Suppose we go back to the 2-sided p-value. Perhaps it is surprising that our p-value is so small, indicating that * the proportion of smokers who survive for 20 years is different from the proportion for non-smokers and that * the difference can't readily be attributed to chance alone. It seems like something is going on here. But the direction of the effect is backwards from our expectations: The smokers in this study were *more* likely to survive. -- The small p-value doesn't mean that smoking *causes* people to live longer. (Remember, this is an observation study, not an experiment -- the researchers did not determine who would smoke and who would not.) There may be other explanations. --- # Comparing Ages ```r df_stats(age ~ smoker, data = Whickham, mean) ``` ``` ## response smoker mean ## 1 age No 48.69945 ## 2 age Yes 44.68213 ``` ```r gf_sina(age ~ smoker, data = Whickham) ``` ![](Whickham-Smokers_files/figure-html/Whickham-age-look-1.png)<!-- --> --- # Hypothesis test for difference in means ```r set.seed(2468) Age_null <- do(1000) * diffmean(age ~ shuffle(smoker), data = Whickham) ``` ```r gf_histogram( ~diffmean, data = Age_null) ``` ![](Whickham-Smokers_files/figure-html/Whickham-age-hist-1.png)<!-- --> As our histogram indicates, a difference of approximately 4 years in the average age of the two groups seems very unusual. None of our 1000 randomization samples had such a large difference. --- # So what can we conclude? We don't know (based on the p-value alone) that this is the cause of the different survival rates either, but it seems likely that it has an impact and helps us understand what might be going on. -- More involved methods (that we don't know yet) could be used to compare the survival rates of smokers and non-smokers after "adjusting for differences in age". -- But here is a little something we can do. Let's look only at people who were in their 50s at baseline. --- # People in their 50s ```r Whickham50s <- Whickham %>% filter(50 <= age & age < 60) df_stats(outcome ~ smoker, data = Whickham50s, props, counts) ``` ``` ## response smoker prop_Alive prop_Dead n_Alive n_Dead ## 1 outcome No 0.7524752 0.2475248 76 25 ## 2 outcome Yes 0.6964286 0.3035714 78 34 ``` ```r gf_violin(age ~ smoker, data = Whickham50s) %>% gf_sina() ``` ![](Whickham-Smokers_files/figure-html/Whickham50s-age-1.png)<!-- --> --- # People in their 50s -- Hypothesis test ```r test_stat50s <- diffprop(outcome ~ smoker, data = Whickham50s); test_stat50s ``` ``` ## diffprop ## -0.05604668 ``` ```r Smoke50s_null <- do(1000) * diffprop(outcome ~ shuffle(smoker), data = Whickham50s) tally(~compare(test_stat50s, diffprop), data = Smoke50s_null) ``` ``` ## compare(test_stat50s, diffprop) ## < = > ## 769 82 149 ``` ```r prop( ~ (diffprop <= test_stat50s), data = Smoke50s_null) ``` ``` ## prop_TRUE ## 0.231 ```