The Whickham Study

class: center, middle, inverse, title-slide

# The Whickham Study
## Smoking and Life Expectency
### Stat 145

---

# The Whickham survey

The Whickham survey was conducted in 1972--74 and a follow-up of the same
people occurred 20 years later. One of the things recorded was whether or not someone
was a smoker at baseline (ie, when the study started in 1972--74) and whether they were 
still alive 20 years later.

You can read more about the study using `?Whickham`.

---

# Hypotheses

Let's see whether this study provides evidence that non-smokers are more likely to survive 
than non-smokers.

**Q**: What are the null and alternative hypotheses for this situation?

* `\(H_0\)`: `\(p_s = p_n\)` (The proportion of smokers who are still alive is equal to
the proportion of non-smokers who are still alive.)

* `\(H_a\)`: `\(p_s < p_n\)` (The proportion of smokers who are still alive is less than
the proportion of non-smokers who are still alive.)

One-sided test makes sense here

* We know which side to choose without referencing the data
* We might want to confirm our suspicion that smoking is bad for your health

Two-side would also be possible

---

# Looking at the data

Looking at the data, we immediately notice something unexpected.

```r
df_stats( outcome ~ smoker, data = Whickham, props)
```

```
##   response smoker prop_Alive prop_Dead
## 1  outcome     No  0.6857923 0.3142077
## 2  outcome    Yes  0.7611684 0.2388316
```

```r
gf_props( ~ outcome | smoker, data = Whickham)
```

![](Whickham-Smokers_files/figure-html/Whickham-props-1.png)

**Smokers in this study are more likely to be alive after 20 years than non-smokers!**

---

# P-value note

**Q**: Do we have any evidence that non-smokers live longer than smokers (from this study)?

**A**: No, the smokers in the study were more likely to be alive.

So our original 1-sided test should have a large p-value (greater than 0.5)

---

# Null Distrbution

Let's create a null distibution in our usual way.

```r
set.seed(2468)
test_stat <- diffprop( outcome ~ smoker, data = Whickham); test_stat
```

```
##   diffprop 
## 0.07537604
```

```r
Smoking_null <- 
  do(1000) * diffprop( outcome ~ shuffle(smoker), data = Whickham)
```

---

# Looking at the Null Distribution

![](Whickham-Smokers_files/figure-html/Whickham-dotplot-1.png)

**Q**: What is the p-value here?

---

# Computing the p-value

------------------------------------
 compare(diffprop, test_stat,    n  
       verbose = TRUE)              
------------------------------ -----
     diffprop < test_stat       998

diffprop = test_stat        1

diffprop > test_stat        1  
------------------------------------

Notice that `diffprop()` is computing `\(p_s - p_n\)` and our p-value is 999/1000 = 0.999.
That's a very large p-value, which makes sense since the data do not provide any evidence
to support the hypothesis that non-smokers live longer than smokers.

---

# 3 p-values

Depending our our alternative hypothesis, there are 3 possible p-values.

### `\(H_a: p_s - p_n \neq 0\)`

If our alternative is 2-sided, we need to double the tail probability to get our
p-value.

```r
gf_histogram( ~ diffprop, data = Smoking_null)
```

![](Whickham-Smokers_files/figure-html/Whickham-prop-hist-1.png)

```r
2 * prop( ~(diffprop >= test_stat), data = Smoking_null)
```

```
## prop_TRUE 
##     0.004
```

---

# 3 p-values

### `\(H_a: p_s - p_n < 0\)`,

This is a natural 1-sided alternative (that smokers are less likely
to be alive at the follow-up).  The p-value in this case is very large 
since we don't have any evidence that non-smokers are *more* likely 
to survive.

```r
prop( ~(diffprop <= test_stat), data = Smoking_null)
```

```
## prop_TRUE 
##     0.999
```

---

# 3 p-values

### `\(H_a: p_s - p_n > 0\)`

Now the 1-tailed p-value is

```r
prop( ~(diffprop >= test_stat), data = Smoking_null)
```

```
## prop_TRUE 
##     0.002
```

But this probably isn't the alternative that any of us had in mind 
*before looking at the data*, and we are not allowed to select the 
alternative based on the data.

---

# Why the small 2-sided p-value?

Suppose we go back to the 2-sided p-value.
Perhaps it is surprising that our p-value is so small, 
indicating that

* the proportion of smokers who survive for 20 years is
different from the proportion for non-smokers and that
* the difference can't readily be attributed to chance alone.

It seems like something is going on here. But the direction of the effect
is backwards from our expectations: The smokers in this study were
*more* likely to survive.

The small p-value doesn't mean that smoking *causes* people to live longer. 
(Remember, this is an observation study, not an experiment -- the researchers
did not determine who would smoke and who would not.)
There may be other explanations.

---

# Comparing Ages

```r
df_stats(age ~ smoker, data = Whickham, mean)
```

```
##   response smoker     mean
## 1      age     No 48.69945
## 2      age    Yes 44.68213
```

```r
gf_sina(age ~ smoker, data = Whickham)
```

![](Whickham-Smokers_files/figure-html/Whickham-age-look-1.png)

---

# Hypothesis test for difference in means

```r
set.seed(2468)
Age_null <- 
  do(1000) * diffmean(age ~ shuffle(smoker), data = Whickham)
```

```r
gf_histogram( ~diffmean, data = Age_null)
```

![](Whickham-Smokers_files/figure-html/Whickham-age-hist-1.png)

As our histogram indicates, a difference of approximately 4 years in the 
average age of the two groups seems very unusual.  None of our 1000 randomization
samples had such a large difference.

---

# So what can we conclude?

We don't know (based on the p-value alone)
that this is the cause of the different survival rates either, 
but it seems likely that it has an impact and helps us understand what might be 
going on.

More involved methods (that we don't know yet) could be used to compare
the survival rates of smokers and non-smokers after 
"adjusting for differences in age".

But here is a little something  we can do. 
Let's look only at people who were in their 50s at baseline.

---

# People in their 50s

```r
Whickham50s <-
  Whickham %>% 
  filter(50 <= age & age < 60)

df_stats(outcome ~ smoker, data = Whickham50s, props, counts)
```

```
##   response smoker prop_Alive prop_Dead n_Alive n_Dead
## 1  outcome     No  0.7524752 0.2475248      76     25
## 2  outcome    Yes  0.6964286 0.3035714      78     34
```

```r
gf_violin(age ~ smoker, data = Whickham50s) %>% gf_sina()
```

![](Whickham-Smokers_files/figure-html/Whickham50s-age-1.png)

---

# People in their 50s -- Hypothesis test

```r
test_stat50s <- diffprop(outcome ~ smoker, data = Whickham50s); test_stat50s
```

```
##    diffprop 
## -0.05604668
```

```r
Smoke50s_null <- do(1000) * diffprop(outcome ~ shuffle(smoker), data = Whickham50s)
tally(~compare(test_stat50s, diffprop), data = Smoke50s_null)
```

```
## compare(test_stat50s, diffprop)
##   <   =   > 
## 769  82 149
```

```r
prop( ~ (diffprop <= test_stat50s), data = Smoke50s_null)
```

```
## prop_TRUE 
##     0.231
```