There are many situations where we might be interested in the difference between
When our response is numerical, we have two possible ways to compute differences:
In a “two-sample” design, we compare one whole group to the other group.
Two variables for each case
Example: diet (A or B) and amount of weight lost (in pounds) recorded for each person.
In this situation we will take an average for each group and compare the the two averages.
In a paired design, we compare case by case.
Two variables for each case
Example: a pre-test score and post-test score for each person.
In this situation we take a difference (or ratio) for each subject, and the compute the average difference (or ratio).
It is usually easy to tell which situation you are in if you keep track of the variables:
If you are wondering what to do if we have 3 or more groups to compare, that’s an excellent question, but you’ll have to wait a little while for an answer.
It is also possible to have a paired design with a categorical response, but the analysis methods are basically the same because either way we are dealing with two categorical variables.
For each situation below, determine whether you would use a paired design or a 2-sample design and what your variables would be. Also say whether it would be possible to do it the other way and what you would have to change to do it the other way.
You want to know how much higher students’ pulse rates are while taking a test vs during a normal class day.
You want to compare BMI (body mass index) for smokers and non-smokers.
You want to know who gains more weight in the first five years of marriage, men or women?
You want to know whether students perform better on tests if they are printed on blue paper vs white paper.
You want to know whether left-handed people are taller on average than right-handed people.
What is the potential advantage of a paired design?
Analyzing a paired design involves two steps:
mutate()
For the first step, we will learn a new R command for adding a variable to a data set: mutate()
. As an example, let’s look at textbook prices to see if it is cheaper to buy books on Amazon than in the UCLA campus bookstore. This is a paired design because we want to compare prices for the same book to each other.
Let’s begin by taking a look at our data
library(openintro)
gf_point(bookstore_new ~ amazon_new, data = ucla_textbooks_f18) %>%
gf_abline(slope = ~1, intercept = ~0, color = "blue", alpha = 0.5)
## Warning: Removed 133 rows containing missing values (geom_point).
It looks like the bookstore prices are sometimes better, but not as often or by as much as the Amazon prices are better. It looks like the price swings are larger for more expensive books. (Note also the 133 books are missing at least one price. We will remove those below.)
Books <-
ucla_textbooks_f18 %>%
# create the new variable with mutate()
mutate(price_diff = bookstore_new - amazon_new) %>%
# remove rows with missing price data
filter(!is.na(price_diff))
Create a plot showing the price differences in the sample.
On average, how much more expensive are books in the bookstore?
Create a bootstrap distribution for the price difference and use it to create a 95% confidence interval for the mean price difference.
Now let’s try using price ratio instead of price difference.
price_ratio
that contains the price ratio rather than the price difference.If you were doing this study, which would you use, price difference or price ratio? Why?
A group of students were given one of two hypothetical scenarios. In the version 1, they were planning to go to a play, but when they arrived at the theater, they realized they had lost $20. In the version 2, they realized that had lost the the ticket (which costs $20 to replace). In both cases, the students were asked if they would buy a ticket for the play.
Let’s use the data in LittleSurvey
to create a confidence interval for the difference in the proportion of people who would buy a ticket if they had lost cash and the proportion of people who would buy a ticket if they had lost a ticket (\(p_{cash} - p_{ticket}\)).
library(fastR2) # LittleSurvey lives in this package
library(pander)
df_stats(play ~ playver, data = LittleSurvey, counts) %>% pander()
response | playver | n_no | n_yes |
---|---|---|---|
play | v1 | 61 | 103 |
play | v2 | 69 | 44 |
df_stats(play ~ playver, data = LittleSurvey, props) %>% pander()
response | playver | prop_no | prop_yes |
---|---|---|---|
play | v1 | 0.372 | 0.628 |
play | v2 | 0.6106 | 0.3894 |
diffprop(play ~ playver, data = LittleSurvey)
## diffprop
## 0.239
Create a bootstrap distribution for the difference in proportions and and use it to create a 95% confidence interval for the difference in proportions.
[Hint: What three changes do you need to make to the previous line of R code?
Imagine the following hypothetical study.
In an experiment to see how much an SAT Prep course helps, 2000 students are randomly assigned to two groups. One group receives the prep course, the other does not. Each group takes the SAT twice. The prep course group has their prep course between the two SAT tests. In each group, some students do better and some do worse, but in both groups most do better. The average improvement for each group is displayed in the table below
Prep Course | No Course |
---|---|
42.7 | 38.5 |
Answer the following questions about this study
Use the BodyTemp50
data set in Lock5withR
to answer the following questions. This data set was discussed in
Professor Shoemaker taught at Calvin, so I supspect that the subjects of this study are Calvin students (from the 1990s).