Last updated 2020-11-20 18:15:07
  1. For each of the following question, (i) say what type of plot you would make to investigate the question, (ii) identify what variables you would need to make your plot and whether they categorical or numerical (iii) write down the R command to create that plot (you can make up reasonable names for the variables and data sets), and (iv) sketch what the plot might look like if the answer is yes.

    1. Among physicians working in Kent County, do males earn more than females?
    2. Are first-year students at Calvin more likely than returning students to have used the CampusClear app yesterday?
    3. Do larger strawberries have more seeds on them than smaller strawberries?

  1. Using the penguins data set in the palmerpenguins package, answer the following questions.

    1. Create a plot that shows how the distribution of body mass of the penguins varies by species and by sex.
    2. Does your plot make it clear whether there a different number of penguins in each group? If not, create a second plot that does. If so, create a second plot that does not.
    3. Based on your plots, are there are roughly the same number of penguins of each species in the data set or are there more of some and few of others?
    4. Based on your plots, is one sex heavier on average than the other? Is the difference roughly the same across all species?
    5. The sex values are missing for some of the Adelie penguins. Based on your plot, do you think these are males, females, or some of each? How confident are you in that answer? Explain.
    6. Compute the mean body mass for each combination of species/sex combination in the data set. (Hint: You may be able to to change your plot command to df_stats(), perhaps with some minor adjustments – it depend on just which plot you made.)

  1. Using the penguins data set in the palmerpenguins package, answer the following questions. For each you will need to make a plot.

    1. Is there an association between bill length and bill depth? Positive or negative or neither?
    2. Does the answer change if you consider the three species separately?
  2. Smoking. This problem uses the NutritionStudy dataset from the Lock5withR package.

    1. Read the documentation to find out what population these subjects were sampled from. List some ways that the subjects in this sample probably differ from the US population as a whole? Such differences can make it difficult to know to whom the study generalizes.

    2. In the sample, who were more likely to be smokers (Smoke), men or women?

    3. Use this data to test whether we have evidence that the proportion of men who smoke is different from the proportion of women who smoke or whether the observered difference could be attriuted to chance.

    4. There is another variable called EverSmoke in this data set. Why can’t we use this same procedure to see if that variable differs by gender? (We will eventually learn how to do this.)

  3. Exploring Linear Regression. Use this app and take a screen shot

    1. showing a scatter plot with one outlier that changes the slope of the regression line from positive (without the outlier) to negative (with the outlier).

    2. showing a scatter plot with one outlier that doesn’t change the regression line at all.

    3. showing a scatter plot with a correlation coefficient of (approximately) 0 but a strong association between the x and y variables.

    4. showing a scatter plot that has two clusters each of which would have positive correlation on their own, but together have a negative correlation.

  4. Explain why it is always important to look at a plot when doing regression.

  5. COVID-19. It is difficult to get good sensitivity and specificity numbers for COVID-19 tests. There seems to be a great deal of variation in the estimates coming from various studies and there are also a number of different tests and different labs performing the tests under different conditions. An article at https://www.bmj.com/content/369/bmj.m1808 talks about this and suggests that good figures to use for illustrative purposes are a specificity of 95% and a sensitivity of 70%. 20% of people with Terry’s symptoms have COVID-19, so her doctor orders a test. Show your work as you answer the following questions.

    1. If the test comes back positive, what is the chance that she has COVID-19?
    2. If the test comes back negative, what is the chance that she has COVID-19?
    3. How would these two answers change if Terry’s symptoms indicated that she had a 70% chance of having COVID-19?

    When you have done that you can check your work at https://www.bmj.com/content/369/bmj.m1808/infographic.

  6. Body Temperature. Using the BodyTemp50 data in the Lock5withR package,

    1. Compute the mean body temperature for the 50 subjects in the sample.
    2. Plot the bootstrap distribution for the sample mean.
    3. Compute a 95% confidence interval for the mean body temperature.
  7. Pulse. Repeat problem 8 for pulse instead of body temperature.

  8. Height and BMI (men). The following code will select just the men in their 20s from the NHANES data set.

    library(NHANES)
    Men20s <- NHANES %>% filter(Age >= 20, Age <= 29, Gender == "male")
    1. How many men in their 20s are this data set?
    2. Compute a 95% confidence interval for the mean height of American men in their 20s.
    3. Compute a 95% confidence interval for the mean BMI (body mass index) of American men in their 20s.
  9. Height and BMI (women). Repeat the previous problem for women instead of men.

  10. Smoking. The following code will create data sets with just the adult men and adult women from NHANES.

    library(NHANES)
    Men   <- NHANES %>% filter(Age >= 21, Gender == "male")
    Women <- NHANES %>% filter(Age >= 21, Gender == "female")
    1. What proportion of women in their 20s in the sample have smoked at least 100 cigarettes? (Use Smoke100.)
    2. What proportion of adult men in their 20s in the sample have smoked at least 100 cigarettes? (Use Smoke100.)
    3. Compute a 95% confidence interval for the proportion of American women in their 20s who have smoked at least 100 cigarettes?
    4. Compute a 95% confidence interval for the proportion of American men in their 20s who have smoked at least 100 cigarettes?
  11. Marijuana. The General Social Survey asked 1,578 US residents: “Do you think the use of marijuana should be made legal, or not?” 963 (61%) of the respondents said it should be made legal.

    1. Is 61% a sample statistic or a population parameter? Explain.
    2. Construct a 95% confidence interval for the proportion of US residents who think marijuana should be made legal, and interpret it in the context of the data.
    3. A news piece on this survey’s findings states, “Majority of Americans think marijuana should be legalized.” Based on your confidence interval, is this news piece’s statement justified?
  12. Marijuana again. The General Social Survey asked 1,578 US residents: “Do you think the use of marijuana should be made legal, or not?” 963 (61%) of the respondents said it should be made legal.

    1. Test the null hypothesis that the proporion of Americans who think the use of marijuana should be made legal is 1/2. Use a two-sided alternative.
    2. Do the results of your hypothesis test seam reasonable considering your results in the previous problem?
    3. For both problems, you can use rflip(), but rflip() gets used differently. What is the difference? Why?
    4. Make a histogram of your randomization distribution for this problem and your bootstrap distributon from the previous problem. How do they compare?
  13. Cuckoos Cuckoos lay their eggs in the nests of other birds. Is the size of cuckoo eggs different in different host species nests? You can read a data set that has the sizes of cuckoo eggs laid in robin nests and in wren nests using

    Cuckoo <- read.csv('https://rpruim.github.io/s145/data/cuckoo2.csv')
    head(Cuckoo, 3)
    ##   length species
    ## 1  19.85    wren
    ## 2  20.05    wren
    ## 3  20.25    wren

    These data were analyzed in 1902! The length of the eggs is measured in mm.

    1. Make a good plot of the data.
    2. Calculate the mean length of the eggs for each host species.
    3. What do you think? Does it look like the size is differs among the different host species? Or do you think the differences observed in this sample could be attributed to random chance?
    4. Conduct a formal hypothesis test to see if the data support your conclusion.
    5. Now create a 95% confidence interval for the difference in mean lengths of the eggs laid in the different types of nests.
  14. Death penalty. This problem uses the DeathPenalty data set in the fastR2 package. This data set was investigated in

    • Radelet, M. (1981). Racial characteristics and imposition of the death penalty. American Sociological Review, 46:918–927.

    Each row of the data set represents a trial for murder.

    1. Compute the proportions of white defendants and of black defendants who received the death penalty. In the sample, which race of defendant was more likely to receive the death penalty?
    2. If there were really no difference in the death penalty rates for black and white defendants, how likely would it be to see a difference at least that large in our data?
    3. Create 95% confidence interval for the difference in proportions of black and white defendants who receive the death penalty.
    4. Now create a subset of the data (use filter) that contains only cases where the victim was white and repeat parts a – c for this subset.
    5. Now create a subset of the data (use filter) that contains only cases where the victim was black and repeat parts a – c for this subset.
    6. Looking back at all the results in this problem, what do you observe?
  15. Reaction time. In a biology lab, students were timed as they reacted to a stimulus either while using a cell phone or not (baseline). You can load the data with

    React <- read.csv('https://rpruim.github.io/s145/data/WilstermanReactionTime.csv')
    head(React, 3)
    ##   trial observer baseline cellphone
    ## 1     1        1    0.236     0.371
    ## 2     2        1    0.302     0.294
    ## 3     3        1    0.382     0.330
    1. Why would this be considered a paired design?
    2. Create a new variable for the difference bewtween baseline and cell phone reaction times.
    3. On average, how much slower was the reaction time while using a cell phone in the sample data?
    4. Create a confidence interval for the mean difference in reaction times under these two conditions.
    5. Create a confidence interval for the mean ratio between the two reaction times under these two conditions. Interpret the results as a percent difference between the two reaction times.
  16. Body temperature and pulse. Use the BodyTemp50 data set in Lock5withR to answer the following questions. This data set was discussed in

    • Shoemaker, “What’s Normal: Temperature, Gender and Heartrate”, Journal of Statistics Education, Vol. 4, No. 2 (1996)

    Professor Shoemaker taught at Calvin, so I supspect that the subjects of this study are Calvin students (from the 1990s).

    1. How many men and how many women were in this study? Do you think that is a coincidence or part of the study design?
    2. Create a 95% confidence interval for the difference in mean body temperature for men and women. Should you resample within groups?
    3. Create a 95% confidence interval for the difference in mean pulse for men and women. Should you resample within groups?
    4. Create a 95% confidence interval for the correlation between pulse and body temperature. Should you resample within groups? What does this interval tell you about the association between pulse and body temperature in college-age adults?
    5. Compute a p-values for the null hypothesis that the correlation between pulse and body temperature is 0. Is the p-value consistent with what you found out from the confidence interval?
  17. In each part below a null hypothesis and a sample size is given. Use them to compute the expected cell counts.

    1. \(H_0\): \(p_1 = 0.1, p_2 = 0.2, p_3 = 0.3, p_4 = 0.4\), \(n = 100\)
    2. \(H_0\): \(p_1 = 1/3, p_2 = 1/3, p_3 = 1/3\), \(n = 200\)
    3. \(H_0\): \(p_1 = 1/4, p_2 = 1/2, p_3 = 1/4\), \(n = 50\)
  18. Fair die? Alice is interested to know if a die is fair, so she rolls the die 100 times and records each result. Here they are:

    1 2 3 4 5 6
    11 15 24 12 21 19

    She notices that there were not very many 1’s or 4’s and lots of 3’s. Should she be concerned that the die is unfair? Conduct the appropriate randomization test and give Alice some advice about her die.

  19. Frizzled Feathers The Frizzle fowl is a variety of chicken with curled feathers. In a 1930 experiment, Launder and Dunn crossed Frizzle fowls with with the Leghorn variety which has straight feathers. The first generation (F1) produced all slightly frizzled chicks. This made the researchers suspect a co-dominant genetic model. To test this, they interbred F1 to get F2 chicks and recorded the feather type for each chick.

    fizzled slightly frizzled straight
    23 50 20
    1. The codominant model predicts a 1:2:1 ratio. Why?

    2. Use these data to assess the genetic model.

    3. Show all the arithmetic needed to calculate the Chi-squared test statistic “by hand”. (You can use R or a calculator to do the arithmetic.)

  20. Two genetics models Suppose a biologist has two potential models to explain the phenotype behavior of a plant. One predicts a 9:3:3:1 ratio of the phenotypes (let’s call them A, B, C, and D), and the other predicts a 69:21:6:4 ratio. The data from a plant breeding experiment yield the following counts:

    A B C D
    67 18 10 5

    Conduct hypothesis tests for both models and interpret the results. What should the biologist conclude?

  21. Lung Cancer National Cancer Institute statistics indicate that the 5-year survival probability for stage 1 lung cancer patients nationally is 0.60. Suppose that out of a cohort of 120 patients with stage 1 lung cancer at the Dana-Farber Cancer Institute (DFCI) treated with a new surgical approach, 80 of the patients survive at least 5 years. Do the data collected from 120 patients support the claim that the DFCI population treated with this new form of surgery has a different 5-year survival probability than the national population? Answer this questions three ways.

    1. using a 1-proportion randomization test,
    2. using a 1-proportion z-test (based on the normal approximation),
    3. using a chi-squared goodness of fit randomization test

    How do the results compare? (Use 5000 simulations for your randimizations to reduce the amount of randomization variability in your results.)

  22. Work hours and education. Exercise 5.44 of ISLBS describes two variables from the 2010 General Social Survey (GSS). The data are in the gss2010 data set in the openintro package.

    1. Recreate the table displayed in this problem using df_stat(). (Note the format won’t be exaclty the same, but the contents should be.)

    2. Recreate the side-by-side boxplots that appear there.

    3. Are there any reasons to be concerned about using the theoretical ANOVA method with this data set? Be sure to say how you are making your decision.

    4. Create the ANOVA table. (You can comapre it to what appears in .)

    5. What is the p-value? What is the conclusion of the test?

    6. Redo this as a randomization test and comment on how the results compare.

  23. Prison isolation. Exercises 5.35 and 5.47 of ISLBS uses data from the prison data set in the openintro package. (See Exercise 5.35 for a description of the study and the variables involved.)

    But if you look at the data, it isn’t arranged the way we need it to be for ANOVA. The code below will rearrange the data set. The first part creates new variables measuring the change from pre-treatment to post-treatment. The second part rearranges things so that each row represents a case.

    library(tidyr)
    Prison2 <-
      prison %>% 
      mutate(
        treatment1 = post_trt1 - pre_trt1,
        treatment2 = post_trt2 - pre_trt2,
        treatment3 = post_trt3 - pre_trt3
      ) %>% 
      select(matches('treat')) %>%
      pivot_longer(matches('treat'), values_to = "change", names_to = "treatment")
    1. What must each row of our data set represent when we use lm() and anova()?

    2. What kinds of variables must we have in order to use ANVOVA?

    3. Create an appropriate plot of the raw data.

    4. Check to see if the conditions for ANOVA are met well enough that we can feel comfortable using this method.

    5. Answer the questions from ISLBS 5.47 using our modified data set.

    6. And for good measure, compute a p-value again using randomization instead of the mathematical model.

  24. Helmets and lunches. This problem use the helmet data set from the openintro package. Use ?helmet to learn more about this data set. We will use this data set to see how the percent of children wearing helmets is related to the percent of students in that neighborhood who receive free or reduced-fee lunches at school.

    gf_point(helmet ~ lunch, data = helmet)

    1. What are the cases in this study? How many are there?

    2. Is this an observational study or an experiment?

    3. What is the equation of the least squares regression line?

    4. Interpret the slope in the context of the study.

    5. Is the intercept meaningful in this context? If so, what does it represent? If not, why not?

    6. What percent of helmet wearing does the model predict for a neighborhood where 20% of students receive free or reduced-fee lunch?

    7. What is the residual for this observation?

      lunch helmet
      73 5.8
    8. Do the residuals appear to be approximately normally distributed?

    9. Compute a 95% confidence interval for the slope of the regression line. Do this two ways, once creating a bootstrap distribution using do() and once letting R compute everthing for you (using the mathematical model).

    10. Find \(R^2\) for the least squares regression line. What does this value tell us about the relationship between these two variables?

    11. What is the correlation coefficient \(R\)?