Here are four examples we have seen.
set.seed(123)
# Lady Tasting Tea
Tea_null <- do(2000) * rflip(20)
# Malaria vaccine study
Malaria_null <- do(2000) * diffprop(malaria ~ shuffle(group), data = Malaria)
# Gender discrimination in promotion decisions?
Promotion_null <- do(2000) * diffprop(decision ~ shuffle(gender), data = Promotion)
# Does caffeine increase finger tapping rate?
Taps_null <- do(2000) * diffmean(Taps ~ shuffle(Group), data = CaffeineTaps)
Let’s look at the randomization distributions for each of these examples.
Where are they centered?
How would you describe their shape?
Null distributions will always be centered at the value the test statistic would have if the null hypothesis were true.
Many (but not all) null distributions have a symmetric, bell-shaped distribution.
If we could know this shape without doing the randomization, we could get our p-values from this distribution without doing the randomization.
This approximation is better for larger sample sizes than for smaller sample sizes.
Many (again, not all) data distributions also have approximately this same shape as well. Since it shows up so often, we want to learn a bit about these bell-shaped distributions.
The bell-shaped curve that many distributions (approximately) have is called a normal distribution.
Actually, there are many normal distributions. They all have the same basic shape, but the can have different means (center) and standard deviations (spread).
We will denote a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) as \({\sf Norm}(\mu, \sigma)\).
For example, here is the Norm(100,20)-distribution.
gf_dist("norm", mean = 100, sd = 20, title = "Norm(100, 20)")
These distributions have a number of special mathematical properties. Here are four important ones for us.
Since they are symmetric, the mean, median, and mode are all the same. Half of the distribution is below the mean and half is above.
~68% of the distribution is between 1 standard deviation below the mean and 1 standard deviation above the mean.
~95% of the distribution is between 2 standard deviations below the mean and 2 standard deviations above the mean.
~99.7% of the distribution is between 3 standard deviations below the mean and 3 standard deviations above the mean.
The same sort of thing is true for any number of standard deviations up and down from the mean, but we won’t memorize any more of these. Instead, we will let R compute the values for us. (Stay tuned for how this is done.)
A certain IQ test has scores that are approximately normally distributed with a mean of 100 and a standard deviation of 15.
In a normal distribution with mean 20 and standard deviation 2,
The most important thing to know about a value in a normal distribution is how many standard deviations above or below the mean it is. This number is called its z-score or standardized score.
\[z = \frac{\mathrm{value} - \mathrm{mean}}{\mathrm{standard deviation}}\]
The heights of men and women in the US ages 18–24 are approximately normal. For men, the mean is 70 inches with a standard deviation of 2.8 inches. For women, the mean is 64.3 inches with a standard deviation of 2.6 inches.
Use the information below to compute the z-score for each test statistic.
diffprop(malaria ~ group, data = Malaria)
## diffprop
## 0.6428571
df_stats(~diffprop, data = Malaria_null)
## response min Q1 median Q3 max mean
## 1 diffprop -0.7857143 -0.07142857 -0.07142857 0.1666667 0.6428571 0.001190476
## sd n missing
## 1 0.2464614 2000 0
diffprop(decision ~ gender, data = Promotion)
## diffprop
## -0.2916667
df_stats(~diffprop, data = Promotion_null)
## response min Q1 median Q3 max mean sd
## 1 diffprop -0.4583333 -0.125 -0.04166667 0.04166667 0.375 -0.009458333 0.131084
## n missing
## 1 2000 0
diffmean(Taps ~ Group, data = CaffeineTaps)
## diffmean
## -3.5
df_stats(~diffmean, data = Taps_null)
## response min Q1 median Q3 max mean sd n missing
## 1 diffmean -4.3 -0.9 -0.1 0.9 3.5 -0.0412 1.318792 2000 0
What does the z-score of a test-statistic tell you about the p-value (assuming the null distribution is approximately normal)?
Of course, not every value is 0, 1, 2, or 3 standard deviations away from the mean. We could memorize even more values, but that would quickly get tedious. Instead we will let R compute these values for us as we need them.
The one slightly tricky thing is that R works with “below” rather than “between”. (For most uses this is actually easier.) The two functions we need are pnorm()
and qnorm()
.
pnorm(x, mean, sd)
computes the proportion of the normal distribution with mean mean
and standard deviation sd
that is below x
.qnorm(p, mean, sd)
computes the value in the normal distribution with mean mean
and standard deviation sd
that has proportion p
below it.If you want to get fancy, you can use xpnrom()
and qpnorm()
. These functions will additionally draw a picture of the normal distribution for you.
Here are some examples. The first one is worked out for you.
Going back to the IQ test with mean 100 and standard deviation 15,
What proportion of IQ tests are below 110?
xpnorm(110, mean = 100, sd = 15)
##
## If X ~ N(100, 15), then
## P(X <= 110) = P(Z <= 0.6667) = 0.7475
## P(X > 110) = P(Z > 0.6667) = 0.2525
##
## [1] 0.7475075
What proportion of IQ tests are between 90 and 110?
# We can get this be subtracting the part that is below 90
pnorm(110, mean = 100, sd = 15) - pnorm(90, mean = 100, sd = 15)
## [1] 0.4950149
Don scores at the 95th percentile. What is his IQ score?
xqnorm(0.95, mean = 100, sd = 15)
##
## If X ~ N(100, 15), then
## P(X <= 124.6728) = 0.95
## P(X > 124.6728) = 0.05
##
## [1] 124.6728
More IQ
https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp lists the mean and standard deviation for SAT scores in each state. For Michigan, the mean is 1005 and the standard deviation is 195. The distibution is approximately normally distributed.
We can ask questions about z-scores by using normal distribution with mean 0 and standard deviation 1. (These are the default values, so you can actually omit them altogether if you like.)
This gives us a second way to work with normal distribuitons: Translate the question to one about z-scores, and work with a Norm(0, 1) distribuiton. This important example of a normal distribution is called the standard normal distribution.
Z-score review. Fill in the blank.