Stat 145
Intro Biostatistics
Fall 2020


[RStudio@Calvin] [Dope Sheets] [From Class] [Calendar] [Test Info] [IMS text] [IMS errata] [Homework]

Dope Sheets

dope1

Here’s where you can come to find out what’s up each day. Homework assignments are available on the homework page.

Week 1

Tue, Sept 1

  • Meet outdoors between Spoelhof and the road, near the Jonah/fish/cheese sculpture.

  • Topic of the Day: Introduction to Stat 145

  • Before tomorrow’s class

    • Read Chapter 1 of IMS. You don’t have to pick up every detail on this reading. Instead focus on the following key terms.2

      • case (also called observational unit) vs. variable

      • numerical (also called quantitative) variable vs. categorical variable

      • population vs. sample

      • experiment vs. observational study

      • explanatory variable vs response variable

      • parameter vs. statistic

      • bias

      There is less terminology in statistics than in biology, but it must be used more carefully, so it is important that we understand these terms. Notice that most of these come in pairs. These pairs make some important distinctions. (Example of a distinction from biology: male vs. female makes an important distinction in many types of organisms.)

      Note: Most of these terms can also be found in Sections 1.1-1.3 of of ISLBS. Feel free to look there for a (sligtly) different explanation and different examples.

    • Do PS 0

      • Submit it via Gradescope, and

      • bring it to class tomorrow.

      This assignment is a little bit unusual. I don’t usually have things due before we discuss them, but I want to use this assignment to guide our discussion tomorrow. It will be graded for completion and effort, not for correctness. It will also give you a chance to try out Gradescope in a low stakes setting.


Wed, Sept 2

  • Before class: complete PS 0. Submit it via Gradescope, and also bring it with you.

  • Meet outdoors between Spoelhof and the road, near the Jonah/fish/cheese sculpture.

  • Handout: Data and Studies


Fri, Sept 4

  • Meet online (Teams) – video

  • Test Blackout Dates

    • We’ve decided not to have tests on advising days, so now we need to pick the dates for our tests. You can help me avoid bad days by filling out this form where you can list dates that don’t work well for you. I can’t promise to avoid every person’s blackout dates, but I’ll try to avoid dates that seem to be a problem for several people in the class.
  • Groups for the day (first person start the meeting, others join)

    • Group 1: Saul Miranda Valencia, Marian Henderson, Alyssa Dekker
    • Group 2: Nathan Haverstick, Abigail Liebetreu, Lucas Walker
    • Group 3: Bryce Reynolds, Abigail Strong, Claire Stannis
    • Group 4: Brielle De Nooyer, Brant Van Noord, Christian Swaim
    • Group 5: Alex Van Uffelen, Lyric Johnson, Elizabeth Griffen
    • Group 6: Jared Van Noord, Elijah Faith, Clinton Jackson
    • Group 7: Ana Li Warners, Robin Kollar, Cameron Massy
    • Group 8: Samuel Ydenberg, Michael Akpabey, Brandon Turcotte
    • Group 9: Hannah Brown, Sara Koenig, Eleanor Scheeres
    • Group 10: Hayden Janssen, Jeffrey Arthur, Emma Thompson
  • Topic of the day: Exploring data with plots

    • Google Presentation for your assignment.
      • Details are in part C of the tutorial below.
    • Plotting Tutorial (hosted online)
      • Part A (Births) - We went through this part in class.
      • Part B (NHANES) – This will give you a chance to try things with a different data set and lots of specific direction.
      • Part C (more practice) – This will let you explore on your own and introduce a few more “extras”.
    • Alternative (only use if hosted tutorial isn’t working for some reason): Run in RStudio console. Copy and paste one of the commands below – the one for the part you want to do.
    learnr::run_tutorial('PlottingBasics2020A', 
      package = 'StatTutor')
    learnr::run_tutorial('PlottingBasics2020B', 
      package = 'StatTutor')
    learnr::run_tutorial('PlottingBasics2020C', 
      package = 'StatTutor')

Week 2

Groups

Groups for this week (first person start the meeting, others join)

  • Group 1: Saul Miranda Valencia, Marian Henderson, Alyssa Dekker
  • Group 2: Nathan Haverstick, Abigail Liebetreu, Lucas Walker
  • Group 3: Bryce Reynolds, Abigail Strong, Claire Stannis
  • Group 4: Brielle De Nooyer, Brant Van Noord, Christian Swaim
  • Group 5: Alex Van Uffelen, Lyric Johnson, Elizabeth Griffen
  • Group 6: Jared Van Noord, Elijah Faith, Clinton Jackson
  • Group 7: Ana Li Warners, Robin Kollar, Cameron Massy
  • Group 8: Samuel Ydenberg, Michael Akpabey, Brandon Turcotte
  • Group 9: Hannah Brown, Sara Koenig, Eleanor Scheeres
  • Group 10: Hayden Janssen, Jeffrey Arthur, Emma Thompson

Mon, Sept 7

  • video recording
  • Announcements
    • PS 01 (gradescope) and PS 02 (google presentation) are due tonight.

    • The problems in IMS Chapter 2 were re-orderd this morning. I’ve adjusted my homework sheet to match, but let me know if you spot any errors on my part.

    • Submit your test blackout dates soon. Fill out the form form multiple times to submit multiple dates.

    • Contribute to the text by submitting any errors you find. I’ll give a few bonus homework points to students whose contributions are frequent and high quality. Check the previous reports before submitting to avoid submitting duplicates.

  • Topics of the Day

Tue, Sept 8

  • Summarizing Exercise Worksheet: [HTML] [PDF]

  • Topic of the Day: Exploring Numerical Data

    • 2 variables
    • 1 variable
      • numbers: mean, median, standard deviation, IQR, symmetry, skew
      • pictures: histogram, density plot, box plot, violin plot, dot plot
    • transformations can change shape of distributions and patterns of association
      • “Power” transformations: \(x \mapsto x^p\) \[ \stackrel{-1}{\mathrm{reciprocal}} \ < \ \stackrel{0}{\log} \ < \ \stackrel{1/2}{\mathrm{square\ root}} \ < \ \stackrel{1}{\mathrm{identity}} \ < \ \stackrel{2}{\mathrm{square}} \ < \ \stackrel{3}{\mathrm{cube}} < \cdots \]
      • log transformations are natural in many situations.
    • Cool thing for exploring data: mplot(county)
      • You can replace `county with any other data frame.
      • Use “Show Expression” button to get the code to make the plot. (You can copy and paste that into RMarkdown, for example.)
      • Note: there is currently a bug in RStudio that causes this to not work properly all the time. If it doesn’t work for you type this in your console and try again.
      library(manipulate)
      manipulate(plot(A), A = slider(1, 10))

Wed, Sept 9

  • Announcements
    • PS 3A (RMarkdown) and PS 3B (IMS 2.1) are due tomorrow at 11:59, even thought we don’t have class on Thursdays.
  • Topic of the Day: Exploring Categorical Data
    • Worksheets
      • Summarizing Exercise (from last time): [HTML] [PDF]
      • Summarizing Categorical Data (new): [HTML] [PDF]
    • computing proportions – keep your eye on the denominator!
    • bar charts – dodge or stack?
    • pie charts – should we ever use them?

Fri, Sept 11

  • Intro to RStudio and RMarkdown in GIFS

  • Self-Quiz 1

  • Topic of the Day: Numerical and Graphical Summary “Wrap-Up”

    • Not really a wrap-up, we will continue to get better as the semester goes along.
    • Designing and creating plots
      • Anatomy: What are the components of a plot and how do we describe them to R?
      • Physiology: What makes a plot good (for a particular purpose)?
    • Worksheets
      • Summarizing Exercise (from Tuesday): [HTML] [PDF]
      • Summarizing Categorical Data (from Wednesday): [HTML] [PDF]

Week 3

Groups for the Week

  • Group 1: Jared Van Noord, Alyssa Dekker, Alex Van Uffelen, Emma Thompson
  • Group 2: Hyeon Kim, Clinton Jackson, Christian Swaim, Marian Henderson
  • Group 3: Eleanor Scheeres, Elijah Faith, Bryce Reynolds, Cameron Massy
  • Group 4: Jeffrey Arthur, Nathan Haverstick, Brandon Turcotte, Claire Stannis
  • Group 5: Abigail Strong, Abigail Liebetreu, Samuel Ydenberg, Michael Akpabey
  • Group 6: Sara Koenig, Brielle De Nooyer, Saul Miranda Valencia, Lyric Johnson
  • Group 7: Robin Kollar, Ana Li Warners, Hayden Janssen, Brant Van Noord
  • Group 8: Elizabeth Griffen, Hannah Brown, Lucas Walker

Mon, Sep 14

  • Announcements
    • PS 4 due today at 11:59 pm
    • New Resource: R Examples
    • Watch for news about tomorrow’s class
  • Topic of the Day: Malaria Vaccine Case Study (IMS 2.4)
    • Malaria Vaccine Case Study Worksheet: [HTML] [PDF]

Tue, Sep 15

  • Announcements
    • Meet in CFAC Tent today

    • Combining PDFs for gradescope: If you need to combine multiple PDFS into a single document to submit to gradescope, you can find tools for this online. (Note: If you submit mutliple times in gradescope, each submission replaces the previous submission. This lets you submit a revised version if you find an error, or complete more of the assignment. But if you have your assignment in mutltiple parts – perhaps some done by hand and some done with R – you need to combine them before submitting.)

      Here’s one PDF combiner that I found:

      Let me know if you find something better and I’ll add it to the list.

  • Topics of the Day
    • Evaluating the evidence in the Malaria Vaccine Case Study
      • Understanding the logic of this simulation is very important
      • We will learn how to get R to do this much more quickly tomorrow.
    • Fitting lines to data: [HTML] [PDF]
      • Key idea: How does SSE measure how well a line fits a data set?

Wed, Sep 16

  • Announcements
    • If you tried to submit mulitple documents for PS 4, please combine them into a single PDF and resubmit. (Gradescope replaces previous submissions each time you submit.)
    • Next PS due on Thursday
    • Test 1 on Friday, Oct 2
  • Topics of the Day
    • Using simulations to evaluate evidence
    • Least Squares Regression Lines
      • Using SSE to find the least squares line
      • lm() – R’s function for finding this line
      • Residuals and residual plots

Fri, Sep 18

  • Announcements
  • Topics of the Day
    • Testing for a difference in proportions – [HTML] [PDF]

Week 4

Mon, Sep 21

  • Announcements

    • Test 1 next Friday
    • PS due tonight at 11:59pm
      • PS 1 and PS 3A have been graded. More grading coming soon.
      • Be sure to correctly indicate which problems are on which pages
  • Topics of the Day:

    • Follow-up on Testing for a difference in proportions – [HTML] [PDF]
      • generalizability
      • does it matter that there are more right-handers? [case-control study designs]
      • expressing statistical hypotheses with mathematical notation
      • drawing conclusions by comparing test statistic to null distribution

Tue, Sep 22

Wed, Sep 23

  • New Groups!

    • Group 1: Hannah Brown, Michael Akpabey, Clinton Jackson, Sara Koenig
    • Group 2: Elizabeth Griffen, Lucas Walker, Hayden Janssen, Robin Kollar
    • Group 3: Alex Van Uffelen, Bryce Reynolds, Emma Thompson, Brandon Turcotte
    • Group 4: Hyeon Kim, Alyssa Dekker, Brant Van Noord, Claire Stannis
    • Group 5: Cameron Massy, Elijah Faith, Samuel Ydenberg, Christian Swaim
    • Group 6: Saul Miranda Valencia, Ana Li Warners, Marian Henderson, Abigail Strong
    • Group 7: Jeffrey Arthur, Jared Van Noord, Nathan Haverstick, Lyric Johnson
    • Group 8: Abigail Liebetreu, Eleanor Scheeres, Brielle De Nooyer
  • Topic of the Day: The Hypothesis Testing Framework (4 steps, terminology, etc.)

    • Worksheet: [HTML] [PDF]
    • Four Steps of Hypothesis Testing [HTML] [PDF]

Fri, Sep 25

  • Announcements
    • PS 8 will be a bit different
      • Due next Tuesday at 11:59pm
      • PS 8 will be done in Moodle (so it can be auto-graded in time for your test preparation).
      • The assignment should be posted later today and will include some material that we won’t cover until Monday.
      • No assignment due next Thursday (you will be preparing for your test).
    • I expect you will see graded homework trickling in to Gradescope over the next few days. The grader is trying hard to get it all graded before your test.
    • Test next Friday
      • Study guide has been posted
      • Details regarding logistics next week
    • Next Tuesday and Wednesday will be mainly review and practice. I’ve scheduled the CFAC tent again, and we will meet there unless the weather doesn’t cooperate.
  • Topic of the Day: Null Distributions and Normal Distributions

Week 5

Groups for the Week

  • Group 1: Lucas Walker, Samuel Ydenberg, Abigail Liebetreu, Jeffrey Arthur
  • Group 2: Brant Van Noord, Brielle De Nooyer, Ana Li Warners, Claire Stannis
  • Group 3: Hayden Janssen, Hyeon Kim, Bryce Reynolds, Nathan Haverstick
  • Group 4: Cameron Massy, Brandon Turcotte, Eleanor Scheeres, Lyric Johnson
  • Group 5: Jared Van Noord, Christian Swaim, Marian Henderson, Michael Akpabey
  • Group 6: Elizabeth Griffen, Clinton Jackson, Emma Thompson, Elijah Faith
  • Group 7: Robin Kollar, Saul Miranda Valencia, Alyssa Dekker, Abigail Strong
  • Group 8: Sara Koenig, Alex Van Uffelen, Hannah Brown

Mon, 9/28

  • Announcements
    • Test on Friday will use Moodle/Respondus/Gradescope combo
      • I’ll create a practice item so you can see how the system works
    • Meet in CFAC Tent tomorrow – bring your laptop if you have one
  • Topics of the Day

Tue, 9/29

  • Meet in CFAC Tent

  • Practice Quiz for Test 1 is available in Moodle. Read and follow these directions to get things all set up to use Respondus for our test.

  • More Practice with Hypothesis Tests: [HTML] [PDF]

Wed, 9/30

Fri, 10/2

  • Test 1 today

Week 6

Mon, 10/5

  • New groups:
    • Group 1: Cameron Massy, Jeffrey Arthur, Abigail Liebetreu, Saul Miranda Valencia
    • Group 2: Michael Akpabey, Christian Swaim, Eleanor Scheeres, Ana Li Warners
    • Group 3: Elijah Faith, Lyric Johnson, Bryce Reynolds, Nathan Haverstick
    • Group 4: Brandon Turcotte, Robin Kollar, Sara Koenig, Alex Van Uffelen
    • Group 5: Hyeon Kim, Marian Henderson, Abigail Strong, Hayden Janssen
    • Group 6: Samuel Ydenberg, Emma Thompson, Alyssa Dekker, Hannah Brown
    • Group 7: Clinton Jackson, Brant Van Noord, Jared Van Noord, Elizabeth Griffen
    • Group 8: Claire Stannis, Lucas Walker, Brielle De Nooyer
  • Announcements:
    • Test 1 has been graded.
    • Meet in the tent tomorrow (pending reservation confirmation)
    • Next problem set due Thursday night (will be posted later today once tent reservation is confirmed)
  • Topic of the Day: Probability
    • What is probability?
    • Some probability rules
    • Note: IMS does not have a probability section but ISLBS does. You can download ISLBS for free as a PDF.

Tue, 10/6

  • Meet in the CFAC Tent today

  • Topic of the Day: Probability Calculations

Wed, 10/7

  • Topic of the Day: Probability and Biology
    • Worksheet: [HTML] [PDF]

    • Focus your attention on the first 3 problems

      • Bob’s boxes
      • Breast cancer screening
      • Disease testing
    • If you finish those three sections, work on some of the remaining problems which reinforce probability tools we have already learned but show ways they come up in biological settings.

    • There are some comments/hints/solutions to these below. Take a look at them after each section before moving on to the next.

Comments/Hints/Solutions

Box A has 9 balls; Box B has 7 balls. The boxes are equally likely to be selected, but since box A has more balls than box B, those balls are less likely to be chosen.

If it helps, imagine an extreme case: Suppose Box A had 1 million balls and Box B had 1. The balls in Box A would be very unlikely to be selected – even if Box A were chosen. The ball in Box B would be chosen half the time – every time Box B is chosen.

Note 1: The color of the balls is not the issue here. We are asking about the individual balls, not their color.

What’s the point? We cannot use the equally likely rule by counting balls since the balls are not equally likely.

  • First Row:
    • Selecting boxes: \(\operatorname{P}(A) = 1/2\); \(\operatorname{P}(B) = 1/2\)
  • Second Row:
    • \(\operatorname{P}(A \operatorname{and}B) = 0\). (They are mutually exclusive.)
    • \(A \operatorname{or}B\) is the sample space (we only have those two boxes), so \(\operatorname{P}(A \operatorname{or}B) = 1\).
    • We will come back to the last two items in this row shortly.
  • Third Row: Two are easy, two are more challenging
    • Selecting colors from within a box:
      • \(\operatorname{P}(R \mid A) = \frac{7}{9}\) since 7 of the 9 balls in that box are red (and they are all equally like if box A is chosen.
      • \(\operatorname{P}(R \mid B) = \frac{3}{7}\) since 3 of the 7 balls in that box are red (and they are all equally likely if box B is chosen.
    • Our question: \(\operatorname{P}(A \mid R)\). If we know the ball was red, then what is the probability that it was drawn from Box A?

We can get some more probabilities if we use the product rule: \(\operatorname{P}(E \operatorname{and}F) = \operatorname{P}(E) \cdot \operatorname{P}(F \mid E)\)

  • \(\operatorname{P}(A \operatorname{and}R) = \operatorname{P}(A) \cdot \operatorname{P}(R \mid A) =\) ___________
  • \(\operatorname{P}(B \operatorname{and}R) = \operatorname{P}(B) \cdot \operatorname{P}(R \mid B) =\) ___________

Now we can get even more probabilities using \(\operatorname{P}(E \operatorname{or}F) = \operatorname{P}(E) + \operatorname{P}(F) - \operatorname{P}(E\operatorname{and}F)\). Do you see how? (Note: this equation involves 4 probabilities. If you ever know 3 of them, you can solve for the fourth.)

We want \(\displaystyle \operatorname{P}(A \mid R) = \frac{\operatorname{P}(A \operatorname{and}R)}{\operatorname{P}(R)}\). The top part of that fraction is in our inventory. But how do we get the bottom part?

A red ball could have come form either box, so let’s use \(R = (A \operatorname{and}R) \operatorname{or}(B \operatorname{and}R)\) and our rule for unions.

\(\operatorname{P}(R) = \operatorname{P}((A \operatorname{and}R) \operatorname{or}(B \operatorname{and}R)) =\) ________ + __________ - _________

Combining everything we get

\(\operatorname{P}(A \mid R) = \displaystyle \frac{\operatorname{P}(A \operatorname{and}R)}{\operatorname{P}(R)}\) \(= \displaystyle \frac{\operatorname{P}(A) \cdot \operatorname{P}(R \mid A)}{\operatorname{P}(A) \cdot \operatorname{P}(R \mid A) + \operatorname{P}(B) \cdot \operatorname{P}(R \mid B)}\) \(= \displaystyle \frac{\frac12 \cdot \frac79}{\frac12 \cdot \frac79 + \frac12 \cdot \frac37}\) \(=\) 0.6447

Notice how the question has the color in the condition and the solution has the box in the condition. This is our first example of a problem that flips those around. The general approach is similar in all of these problems and is usually credited to Rev. Thomas Bayes. (The formal theorem involved is called Bayes’ Theorem, and there is an entire branch of statistics called Bayesian statistics based on this idea. You can learn about that in Stats 341.)

The approach used above is very algebraic. We can organize the same arithmetic visually using the method of probability trees.

First we set up the tree in this video.

Note: This video uses \(\cap\) to mean “and”. (That’s a common mathematical notation.)

Now that we have our tree, we can use the tree to get the probability we want in this video.

This problem has a similar structure to Bob’s boxes. Let \(C\) be the event that a woman has breast cancer. Let \(+\) mean there is a positive test result. (A positive test result means the test thinks the person has cancer.) Start by identifying the three numbers we are given. Make sure you use good notation.

  • 0.017 = \(P(________)\)
  • 0.780 = \(P(________)\)
  • 0.10 = \(P(________)\) Be careful with this one.

Now see if you can get from there to the probability we want, which is \(P(________)\)


Note: There is a little bit of ambiguity here about whether all the figures (which come from different sources) align to the same people.

Let’s assume that the probabilities given are the probabilities among women who are tested. That might not actually be true of the first figure (which might apply to the whole population). But since we don’t have any information about who does/does not get tested, that’s the best we can do with the data at hand.

Our calculations could be misleading if the rate of cancer is much higher among those who are tested. It probably is at least somewhat higher – we test people more when they are more likely to have cancer.

Since we know \(\operatorname{P}(C)\) and \(\operatorname{P}(+ \mid C)\), it makes sense to set up our tree like this:

probabilty tree Note: We are using \(\overline{C}\) instead of \(C^c\) because \(C^c\) looks a little strange.

This time you just get an answer to check your work. If you didn’t get this answer, review the solutions above (and make sure you have the right notation for each of your events and probabilities).

\(p(D \mid P) = \frac{\frac1{1000}(.99)}{ \frac1{1000}(0.98) + \frac{999}{1000} (0.02)} =\) \(\frac{0.00099}{0.00099 + 0.01998} =\) 0.0049

Note 1: That probability is probably smaller than you expected, but it much larger than \(\operatorname{P}(D)\) – it’s about 50 times as big. So a positive test result increases your changes of having the disease by two orders of magnitude.

Note 2: This is why we don’t test everyone for everything – the result would be lots of false positives. But if we only test people with symptoms, then our before test probability of having the disease is much higher. To see for yourself, redo this problem assuming \(\operatorname{P}(D)\) is \(1/10\), as it might be for someone who has symptoms that could indicate the disease, but could also be something else.

Note 3: In order to accurately interpret the results of a medical test, we need to know more than just the quality of the test. We also need to know the chances that the person taking the test might have the condition being tested for.

Note: You can ask very similar questions about the Corona virus, and most of the interesting questions are conditional probabilities like

  • If I get this and I’m young and healthy, what is the probability that I will need to be hospitalized? (Or to make it less personal, what proportion of young, healthy individuals who contract Covid-19 will need hospitalization?)

  • What proportion of people who are extreme social distancers will contract Covid-19?

  • What proportion of mild social distancers will contract Covid-19?

  • What proportion of our health care workers will contract Covid-19?

  • Eventually: What proportion of people who are vaccinated will be immune?

Right now, people are working hard to get estimates of these numbers, but we probably won’t know for sure until after things have settled down and epidemiologists have had some time to go over the data.

Fri, 10/9

  • Meet in the CFAC tent (The weather is supposed to be really nice, but perhaps a little bit cooler first thing in the morning.)

  • Topic of the Day: Probability Wrap Up

    • Worksheet: [HTML] [PDF]

    • Begin with problems 1 and 2 in the “More Medical Tests” section.

    • Then return to the “Plants” and “People” sections.

    • Finally, if there is time, look at the white cats problem.

Week 7

Mon, 10/12

  • New Groups
    • Group 1: Christian Swaim, Michael Akpabey, Brandon Turcotte, Sara Koenig
    • Group 2: Ana Li Warners, Hannah Brown, Robin Kollar, Lyric Johnson
    • Group 3: Eleanor Scheeres, Brielle De Nooyer, Brant Van Noord, Lucas Walker
    • Group 4: Elijah Faith, Alyssa Dekker, Abigail Liebetreu, Hyeon Kim
    • Group 5: Marian Henderson, Emma Thompson, Claire Stannis, Jared Van Noord
    • Group 6: Jeffrey Arthur, Bryce Reynolds, Clinton Jackson, Hayden Janssen
    • Group 7: Samuel Ydenberg, Abigail Strong, Nathan Haverstick, Cameron Massy
    • Group 8: Saul Miranda Valencia, Elizabeth Griffen, Alex Van Uffelen
  • Topic of the Day: Estimation

Tue, 10/13

Wed, 10/14

  • Test 2 date: Tue or Wed, Oct 27 or 28

  • Topic of the Day: Confidence Intervals from Bootstrap Distributions

Fri, 10/16

  • Test 2 date: Tue, Oct 27

  • Topic of the Day: Confidence Intervals from Bootstrap Distributions

Week 8

Mon, 10/18

  • New Groups

    • Group 1: Alex Van Uffelen, Elizabeth Griffen, Bryce Reynolds, Ana Li Warners
    • Group 2: Hyeon Kim, Abigail Strong, Nathan Haverstick, Sara Koenig
    • Group 3: Saul Miranda Valencia, Samuel Ydenberg, Christian Swaim, Hannah Brown
    • Group 4: Claire Stannis, Alyssa Dekker, Brandon Turcotte, Eleanor Scheeres
    • Group 5: Brant Van Noord, Michael Akpabey, Emma Thompson, Jared Van Noord
    • Group 6: Marian Henderson, Cameron Massy, Lucas Walker, Jeffrey Arthur
    • Group 7: Lyric Johnson, Elijah Faith, Abigail Liebetreu
    • Group 8: Clinton Jackson, Robin Kollar, Hayden Janssen
  • Topic of the Day: Exploring Differences

Tue, 10/20

  • No class tomorrow (advising day).

  • Test 2 next week Tuesday.

  • Little Survey – observational study or experiment?

    • Feel free to get some of your friends or roommates to take the survey to increase our sample size.
  • Resampling within Groups

  • Worksheet [HTML] [PDF]

Fri, 10/23

  • Meet in the CFAC Tent

  • Topic of the Day: Confidence intervals and p-values

  • Worksheets

    • Confidence Intervals and p-values: [HTML] [PDF]
    • Hypothesis tests for 1 mean: [HTML] [PDF]

The bootstrap is centered at 42% = 0.42. The Null distribution is centered at 0.5 (because that’s the value specified in the null hypothesis). The data could be summarized with the bar chart. The taller bar is for the 17 correct reponses and the shorter bar for the 13 incorrect responses in the sample.

The null distribution is centered at 0 and the bootstrap distribution at -0.19 (because the difference in the sample proportions is 0.40 - 0.59 = - 0.19 – If we did the subtraction the other way around things would be centered at 0.19). The data can be summarized using the multi-colored bar chart. It looks like light blue is no favoring a ban and dark blue is favoring a ban. The left group is 2000 and the right group is 2010.

The bootstrap distribution is centerd at 98.26, our estimate from the sample – the lower right plot. The data is also centered at 98.26, but is more spread out – the upper right plot. The null distriution is centered at 98.6, the usual claim for what normal body temperature is.

The bootstrap distribution should be centered at 42.7 - 38.5 = 4.2, and the middle 95% should stretch from 1.04 ot 7.36, with 2.5% on either end of that. As we have seen above, the bootstrap distributions and null distributions have very similar shapes but appear to be shifted. So the bootstrap distribution should be centered at 0 and the middle 95% should stretch to \(\pm 2.1\).

Note that 0 is not in the confidence interval. This means that it is not in the middle 95% of the the bootstrap distribution centered at 4.2. This means that 4.2 is not in the middle 95% of the bootstrap distribution centered at 0, so our p-value will be less than 0.05. This makes sense. If 0 is not a plausible value for the estimand, we should reject it. The 95% confidence level for the interval corresponds to the \(\alpha = 0.05\) significance level for a 2-sided test.

Notice that although the data provide fairly strong evidence that there is difference in mean improvement with and without the preparation, it also provides evidence that on average the difference is not that large since the confidence interval stretches from roughly 1 to roughly 7.5.

The original version of this section left off one of our situations: the paired design. You should add that to your list.

Remember that you can find examples of all or most of these in the R Example document.

Week 9

Mon, 10/26: Review Day

  • Use this Google Doc to post questions you have.

  • Test 2 tomorrow

    • Like on the first test, you will not need to use RStudio during the test, but you may be asked to provide small amounts of R code or ot interpret R output.
    • You will have 65 minutes for the test. The test will open at 8:30am and close at 10:30 am.

Tue, 10/27: Test 2

Wed, 10/28

  • Problem set due tomorrow night (will be posted later this morning)

  • New Groups

    • Group 1: Abigail Liebetreu, Brant Van Noord, Abigail Strong, Elijah Faith
    • Group 2: Marian Henderson, Lyric Johnson, Clinton Jackson, Ana Li Warners
    • Group 3: Cameron Massy, Emma Thompson, Hayden Janssen, Alyssa Dekker
    • Group 4: Samuel Ydenberg, Eleanor Scheeres, Jeffrey Arthur, Alex Van Uffelen
    • Group 5: Jared Van Noord, Robin Kollar, Michael Akpabey, Christian Swaim
    • Group 6: Brandon Turcotte, Bryce Reynolds, Nathan Haverstick, Saul Miranda Valencia
    • Group 7: Hyeon Kim, Hannah Brown, Elizabeth Griffen, Lucas Walker
  • Topic of the Day: Inference for 1 proportion using Normal Distributions and SE

Fri, 10/30

  • Don’t forget to turn your clocks back this weekend!

  • Topic of the Day: Inference for 2 proportions using Normal Distributions and SE

Week 10

Tue, 11/3

  • Advising (and Election) Day, No Class

Wed, 11/4

  • New Groups:

    • Group 1: Cameron Massy, Nathan Haverstick, Saul Miranda Valencia, Brant Van Noord
    • Group 2: Robin Kollar, Christian Swaim, Hayden Janssen, Bryce Reynolds
    • Group 3: Alex Van Uffelen, Marian Henderson, Lyric Johnson, Hyeon Kim
    • Group 4: Elijah Faith, Michael Akpabey, Elizabeth Griffen, Brandon Turcotte
    • Group 5: Eleanor Scheeres, Jared Van Noord, Ana Li Warners
    • Group 6: Lucas Walker, Samuel Ydenberg, Alyssa Dekker
    • Group 7: Abigail Strong, Emma Thompson, Abigail Liebetreu
  • How Wet is the World?

  • Topic of the Day: Goodness of Fit Testing (Golf balls in the yard)

Fri, 11/6

  • Topic of the Day: Goodness of Fit Testing
    • Randomization when the null hypothesis is something other than all proportions are the same.
    • Worksheet: [HTML] [PDF]

Week 11

Mon, 11/9

  • Topics of the Day:
    • Chi-squared distributions for Chi-squared goodness of fit tests
    • chisq.test() and xchisq.test()
  • Worksheet [HTML] [PDF]

Tue, 11/10

  • Test 3 date: Tue, Dec 1 (Tuesday after Thanksgiving)

  • Topic of the Day: Chi-squared tests for two-way tables (2 categorical variables)

  • Worksheet [HTML] [PDF]

Wed, 11/11

  • Test 3 date: Tue, Dec 1 (Tuesday after Thanksgiving)

  • Topics of the Day:

    • Overview of where we have been and where we are going (a.k.a, how do I know what to do with my data?)
    • Theoretical methods for the 1-mean situation
  • Worksheet [HTML] [PDF]

Fri, 11/13

  • Test 3 date: Tue, Dec 1 (Tuesday after Thanksgiving)

  • Topic of the Day: “2-sample t”

    • The method from last time is often called “1-sample t” because we are looking at one mean and using a t distribution.
    • Today we will look at the difference between 2 means, traditionally called “2-sample t”.
      • But we don’t have two data sets, we have one data set with two variables: a quantitative response and a Cat2 explanatory.
      • “2-groups t” might be a more accurate name.
    • Two study desings – both analyzed the same way:
      • Sample two groups separately (and control the size of each).
      • One sample with two “questions” for each case – won’t know in advance how many will be in each group.
    • t.test() can automate both types of t test if we have the raw data.
  • Worksheets

Week 12

Mon, 11/16

  • New Groups

    • Group 1: Lucas Walker, Jared Van Noord, Lyric Johnson, Alex Van Uffelen, Eleanor Scheeres
    • Group 2: Hyeon Kim, Christian Swaim, Brant Van Noord, Marian Henderson
    • Group 3: Cameron Massy, Robin Kollar, Hayden Janssen, Elijah Faith
    • Group 4: Abigail Liebetreu, Nathan Haverstick, Michael Akpabey, Samuel Ydenberg
    • Group 5: Saul Miranda Valencia, Bryce Reynolds, Abigail Strong, Emma Thompson
    • Group 6: Ana Li Warners, Alyssa Dekker, Elizabeth Griffen, Brandon Turcotte
  • Topic of the Day: ANOVA (Analysis of Variance)

  • ANOVA Worksheet: [HTML] [PDF]

Tue, 11/17

  • Don’t start the video recording if I forget – remind me to do it instead.

    • If students start the recording, the recording isn’t available to others and I don’t have access to it in the usual way either.
  • Topic of the Day: More ANOVA (Analysis of Variance)

  • ANOVA Worksheet: [HTML] [PDF]

Wed, 11/18

  • Office hours tonight at 8pm. No office hours tomorrow night.

  • Topic of the Day: ANOVA and factor()

  • ANOVA Worksheets:

Fri, 11/20

  • Topic of the Day: Inference for Regression

  • Worksheet: [HTML] [PDF]

Week 13

Mon, 11/23

  • Topic of the Day: More Regression

    • The mathematical model for linear regression is based on 4 conditions
      • Linear association between explanatory and response variables
      • Independent errors/residuals
      • Normal errors/residuals
      • Equal Standard deviation of errors/residuals
    • We can remember these by remembering the word LINE.
    • We can check to see how well the conditions are met by inspecting residuals.
    • Florida Lakes example
  • Worksheet: [HTML] [PDF]

Tue, 11/24

  • Test 3 is next Tuesday

    • Test Info
    • I’ll say more about some logistics (RStudio, etc.) next Monday
  • Topic of the Day: Review

Week 14

Mon, 11/30

  • Test 3 – when should we have it?

    • Fill out this Google Form to let me know your preferences.

    • Result: Test 3 will be on Wednesday

1 - pt(1.91, df = 10)
## [1] 0.04260244
pt(-3.45, df = 16)
## [1] 0.001646786
2 * (1 - pt(0.83, df = 6))
## [1] 0.4383084
1 - pt(2.13, df = 27)
## [1] 0.02121769

The mean is at the midpoint of the interval: 71

The margin of error is the distance from the edges of the interval to the center: 6

\(t_*\):

t_star <- qt(0.95, df = 24); t_star
## [1] 1.710882

Since \(ME = t_* SE\), we can now solve for SE:

SE <- 6 / t_star; SE
## [1] 3.506963

Finally, \(SE = \frac{s}{\sqrt{n}}\), so

s <- SE * sqrt(25); s
## [1] 17.53481
  1. Lung Cancer National Cancer Institute statistics indicate that the 5-year survival probability for stage 1 lung cancer patients nationally is 0.60. Suppose that out of a cohort of 120 patients with stage 1 lung cancer at the Dana-Farber Cancer Institute (DFCI) treated with a new surgical approach, 80 of the patients survive at least 5 years. Do the data collected from 120 patients support the claim that the DFCI population treated with this new form of surgery has a different 5-year survival probability than the national population? Answer this questions three ways.

    1. using a 1-proportion randomization test,
    2. using a 1-proportion z-test (based on the normal approximation),
    3. using a chi-squared goodness of fit randomization test

    How do the results compare? (Use 5000 simulations for your randimizations to reduce the amount of randomization variability in your results.)

1-proportion via randomization:

Rand1 <- do(5000) * rflip(120, prob = 0.6)
prop( ~(prop >= 80/120), data = Rand1) * 2
## prop_TRUE 
##    0.1532

1-proportion via z-test:

\[\begin{align*} \hat{p} &= \frac{80}{120} = 0.667 \\ SE &= \sqrt{ \frac{ p_0 \cdot p_0 }{ n } } \sqrt{ \frac{ 0.6 \cdot 0.4 }{ 120 } } = 0.0447 \\ z &= \frac{ \hat{p} - p_0 }{ SE } \\ &= \frac{ 0.667 - 0.6}{ 0.0447} \\ &= 1.49 \\ \mbox{p-value} &= 0.14 \end{align*}\]

Chi-squared:

chisq.test( c(80, 40), p = c(0.6, 0.4))
## 
##  Chi-squared test for given probabilities
## 
## data:  c(80, 40)
## X-squared = 2.2222, df = 1, p-value = 0.136

Tue, 12/1

Wed, 12/2

  • Test 3 today

Fri 12/4 & Mon 12/7

  • Almost done.
    • Monday is our last class.
    • Final Exam: Fri 12/11 at 1:30 pm
    • I expect to get you Test 3 results later today.
    • Last Groups
      • Group 1: Brant Van Noord, Lucas Walker, Alyssa Dekker, Saul Miranda Valencia
      • Group 2: Lyric Johnson, Abigail Strong, Hyeon Kim, Brandon Turcotte
      • Group 3: Samuel Ydenberg, Alex Van Uffelen, Hayden Janssen, Robin Kollar
      • Group 4: Christian Swaim, Ana Li Warners, Jared Van Noord, Emma Thompson
      • Group 5: Bryce Reynolds, Marian Henderson, Nathan Haverstick, Elijah Faith
      • Group 6: Michael Akpabey, Abigail Liebetreu, Elizabeth Griffen, Eleanor Scheeres
  • Topic of the Day: Statistics in publications
    • Reading Statistics Worksheet
      • We will work on this Friday and Monday.
      • For Wednesday next week, you will turn in a report on 4 of the articles (your choice). You may do this on your own or in a group of at most 3. (If your group has 4 people, you could choose to divide into two groups of two.) Although this will be mostly text, use R Markdown for your write up.

  1. Definitions selected from Webster’s online dictionary↩︎

  2. One advantage of an online text is that you can search it. You may find it handy to search for these terms.↩︎