For each of the following question, (i) say what type of plot you would make to investigate the question, (ii) identify what variables you would need to make your plot and whether they categorical or numerical (iii) write down the R command to create that plot (you can make up reasonable names for the variables and data sets), and (iv) sketch what the plot might look like if the answer is yes.
Using the penguins
data set in the palmerpenguins package, answer the following questions.
df_stats()
, perhaps with some minor adjustments – it depend on just which plot you made.)Using the penguins
data set in the palmerpenguins package, answer the following questions. For each you will need to make a plot.
Smoking. This problem uses the NutritionStudy
dataset from the Lock5withR package.
Read the documentation to find out what population these subjects were sampled from. List some ways that the subjects in this sample probably differ from the US population as a whole? Such differences can make it difficult to know to whom the study generalizes.
In the sample, who were more likely to be smokers (Smoke
), men or women?
Use this data to test whether we have evidence that the proportion of men who smoke is different from the proportion of women who smoke or whether the observered difference could be attriuted to chance.
There is another variable called EverSmoke
in this data set. Why can’t we use this same procedure to see if that variable differs by gender? (We will eventually learn how to do this.)
Exploring Linear Regression. Use this app and take a screen shot
showing a scatter plot with one outlier that changes the slope of the regression line from positive (without the outlier) to negative (with the outlier).
showing a scatter plot with one outlier that doesn’t change the regression line at all.
showing a scatter plot with a correlation coefficient of (approximately) 0 but a strong association between the x and y variables.
showing a scatter plot that has two clusters each of which would have positive correlation on their own, but together have a negative correlation.
Explain why it is always important to look at a plot when doing regression.
COVID-19. It is difficult to get good sensitivity and specificity numbers for COVID-19 tests. There seems to be a great deal of variation in the estimates coming from various studies and there are also a number of different tests and different labs performing the tests under different conditions. An article at https://www.bmj.com/content/369/bmj.m1808 talks about this and suggests that good figures to use for illustrative purposes are a specificity of 95% and a sensitivity of 70%. 20% of people with Terry’s symptoms have COVID-19, so her doctor orders a test. Show your work as you answer the following questions.
When you have done that you can check your work at https://www.bmj.com/content/369/bmj.m1808/infographic.
Body Temperature. Using the BodyTemp50
data in the Lock5withR
package,
Pulse. Repeat problem 8 for pulse instead of body temperature.
Height and BMI (men). The following code will select just the men in their 20s from the NHANES data set.
library(NHANES)
Men20s <- NHANES %>% filter(Age >= 20, Age <= 29, Gender == "male")
Height and BMI (women). Repeat the previous problem for women instead of men.
Smoking. The following code will create data sets with just the adult men and adult women from NHANES.
library(NHANES)
Men <- NHANES %>% filter(Age >= 21, Gender == "male")
Women <- NHANES %>% filter(Age >= 21, Gender == "female")
Smoke100
.)Smoke100
.)Marijuana. The General Social Survey asked 1,578 US residents: “Do you think the use of marijuana should be made legal, or not?” 963 (61%) of the respondents said it should be made legal.
Marijuana again. The General Social Survey asked 1,578 US residents: “Do you think the use of marijuana should be made legal, or not?” 963 (61%) of the respondents said it should be made legal.
rflip()
, but rflip()
gets used differently. What is the difference? Why?Cuckoos Cuckoos lay their eggs in the nests of other birds. Is the size of cuckoo eggs different in different host species nests? You can read a data set that has the sizes of cuckoo eggs laid in robin nests and in wren nests using
Cuckoo <- read.csv('https://rpruim.github.io/s145/data/cuckoo2.csv')
head(Cuckoo, 3)
## length species
## 1 19.85 wren
## 2 20.05 wren
## 3 20.25 wren
These data were analyzed in 1902! The length of the eggs is measured in mm.
Death penalty. This problem uses the DeathPenalty
data set in the fastR2
package. This data set was investigated in
Each row of the data set represents a trial for murder.
filter
) that contains only cases where the victim was white and repeat parts a – c for this subset.filter
) that contains only cases where the victim was black and repeat parts a – c for this subset.Reaction time. In a biology lab, students were timed as they reacted to a stimulus either while using a cell phone or not (baseline). You can load the data with
React <- read.csv('https://rpruim.github.io/s145/data/WilstermanReactionTime.csv')
head(React, 3)
## trial observer baseline cellphone
## 1 1 1 0.236 0.371
## 2 2 1 0.302 0.294
## 3 3 1 0.382 0.330
Body temperature and pulse. Use the BodyTemp50
data set in Lock5withR
to answer the following questions. This data set was discussed in
Professor Shoemaker taught at Calvin, so I supspect that the subjects of this study are Calvin students (from the 1990s).
In each part below a null hypothesis and a sample size is given. Use them to compute the expected cell counts.
Fair die? Alice is interested to know if a die is fair, so she rolls the die 100 times and records each result. Here they are:
1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
11 | 15 | 24 | 12 | 21 | 19 |
She notices that there were not very many 1’s or 4’s and lots of 3’s. Should she be concerned that the die is unfair? Conduct the appropriate randomization test and give Alice some advice about her die.
Frizzled Feathers The Frizzle fowl is a variety of chicken with curled feathers. In a 1930 experiment, Launder and Dunn crossed Frizzle fowls with with the Leghorn variety which has straight feathers. The first generation (F1) produced all slightly frizzled chicks. This made the researchers suspect a co-dominant genetic model. To test this, they interbred F1 to get F2 chicks and recorded the feather type for each chick.
fizzled | slightly frizzled | straight |
---|---|---|
23 | 50 | 20 |
The codominant model predicts a 1:2:1 ratio. Why?
Use these data to assess the genetic model.
Show all the arithmetic needed to calculate the Chi-squared test statistic “by hand”. (You can use R or a calculator to do the arithmetic.)
Two genetics models Suppose a biologist has two potential models to explain the phenotype behavior of a plant. One predicts a 9:3:3:1 ratio of the phenotypes (let’s call them A, B, C, and D), and the other predicts a 69:21:6:4 ratio. The data from a plant breeding experiment yield the following counts:
A | B | C | D |
---|---|---|---|
67 | 18 | 10 | 5 |
Conduct hypothesis tests for both models and interpret the results. What should the biologist conclude?
Lung Cancer National Cancer Institute statistics indicate that the 5-year survival probability for stage 1 lung cancer patients nationally is 0.60. Suppose that out of a cohort of 120 patients with stage 1 lung cancer at the Dana-Farber Cancer Institute (DFCI) treated with a new surgical approach, 80 of the patients survive at least 5 years. Do the data collected from 120 patients support the claim that the DFCI population treated with this new form of surgery has a different 5-year survival probability than the national population? Answer this questions three ways.
How do the results compare? (Use 5000 simulations for your randimizations to reduce the amount of randomization variability in your results.)
Work hours and education. Exercise 5.44 of ISLBS describes two variables from the 2010 General Social Survey (GSS). The data are in the gss2010
data set in the openintro
package.
Recreate the table displayed in this problem using df_stat()
. (Note the format won’t be exaclty the same, but the contents should be.)
Recreate the side-by-side boxplots that appear there.
Are there any reasons to be concerned about using the theoretical ANOVA method with this data set? Be sure to say how you are making your decision.
Create the ANOVA table. (You can comapre it to what appears in .)
What is the p-value? What is the conclusion of the test?
Redo this as a randomization test and comment on how the results compare.
Prison isolation. Exercises 5.35 and 5.47 of ISLBS uses data from the prison
data set in the openintro
package. (See Exercise 5.35 for a description of the study and the variables involved.)
But if you look at the data, it isn’t arranged the way we need it to be for ANOVA. The code below will rearrange the data set. The first part creates new variables measuring the change from pre-treatment to post-treatment. The second part rearranges things so that each row represents a case.
library(tidyr)
Prison2 <-
prison %>%
mutate(
treatment1 = post_trt1 - pre_trt1,
treatment2 = post_trt2 - pre_trt2,
treatment3 = post_trt3 - pre_trt3
) %>%
select(matches('treat')) %>%
pivot_longer(matches('treat'), values_to = "change", names_to = "treatment")
What must each row of our data set represent when we use lm()
and anova()
?
What kinds of variables must we have in order to use ANVOVA?
Create an appropriate plot of the raw data.
Check to see if the conditions for ANOVA are met well enough that we can feel comfortable using this method.
Answer the questions from ISLBS 5.47 using our modified data set.
And for good measure, compute a p-value again using randomization instead of the mathematical model.
Helmets and lunches. This problem use the helmet
data set from the openintro
package. Use ?helmet
to learn more about this data set. We will use this data set to see how the percent of children wearing helmets is related to the percent of students in that neighborhood who receive free or reduced-fee lunches at school.
gf_point(helmet ~ lunch, data = helmet)
What are the cases in this study? How many are there?
Is this an observational study or an experiment?
What is the equation of the least squares regression line?
Interpret the slope in the context of the study.
Is the intercept meaningful in this context? If so, what does it represent? If not, why not?
What percent of helmet wearing does the model predict for a neighborhood where 20% of students receive free or reduced-fee lunch?
What is the residual for this observation?
lunch | helmet |
---|---|
73 | 5.8 |
Do the residuals appear to be approximately normally distributed?
Compute a 95% confidence interval for the slope of the regression line. Do this two ways, once creating a bootstrap distribution using do()
and once letting R compute everthing for you (using the mathematical model).
Find \(R^2\) for the least squares regression line. What does this value tell us about the relationship between these two variables?
What is the correlation coefficient \(R\)?