Reading Published Statistics

Seven Critical Components of a Statistical Report

The list below is modified from a list on page 16 of Seeing Through Statistics, 2nd ed. by Jessica Utts. It enumerates seven critical things to look for when reading a report (and seven things to include if you are writing a report). The level of detail will vary depending on the audience. Much of this information might be missing in a news report, but it should be present in a good scientific paper and in the better news reports.

Source of funding (Why was the research done? Who paid for it?)

Knowing if the author’s had a stake in a particular outcome can be important information. Many scientific journals now require conflict of interest declarations.
Researcher contact (How did researchers interact with subjects?)

This is most important for studies with human subjects, but could apply to animal studies as well.
Individuals studied and how they were selected

Typically there is an implicit or explicit claim that the results of the study generalize more boradly. Knowing if that claim is warranted and to what population is a key to knowing what the study really says. Look for ways that the cases in the study might be different from the population the authors want to know about.
Measurements made (questions asked)

Good public opinion surveys will tell you exactly how they phrased their question.
Good scientific articles will have a methods section that explains exactly how measurements were made. (If you are not familiar with the area, the description description might be a bit hard to follow, but it should be clear to anyone working in that area.)

Summary reports are often very vague about this. They might refer to “size”, for example, without saying whether they are measuring height or weight or volume or soemthing else.
Setting in which the data were collected

Were the data collected in a highly controlled laboratory environment or were they collected “in the wild”? Experiment or observational study?
Extraneous differences and other explanations

Many studies purport to show that two or more things are different in some way. But it is possible that the cause of the difference is something other than the explanatory variable. This is especially true of observational studies, but it can effect experiments as well. Often studies will include a section showing all the ways their treatment groups do not differ as a way to address this issue.

But no study can include everything, and you want to be sure that there isn’t some obvious unconsidered difference that could explain the results.
Magnitude of claimed effect

“Statistical significance” simply means that some p-value is below a specified threshold. But that doesn’t mean that the result is important or meaninful. A good report will include a measure of the effect size. Many researchers prefer to use confidence intervals rather than (or in addition to) p-values for this reason.

Sometimes modifiers are used: “highly significant”, “marginally significant”, etc. Some authors also use “suggestive” for p-values that are just a bit larger than their cut-off (usually 0.05, but sometimes not).

The following questions are different and may have different answers:
- Can I discern that two things are different (on average)? [statistical significance]
- Is the difference large enough to matter? [magnitude of effect]

Figuring out the Statistical Methods

Scientific papers are often very terse in the description of the statistical methods that they use. And they may use methods that are unfamiliar to you. But here are some things you can look for.

P-values (often just labeled p, or perhaps just roughly indicated by some sort of symbol reflecting the size of the p-value) indicate that a hypothesis test was done.
- If the test statistic is included the letter used (\(z\), \(t\), \(X^2\), \(F\), etc.) can be a clue to the type of test that was done.
- Tests are often named after people. If you see “we used so-and-so’s test”, you can usually do an internet search to find out what that test is. Wikipedia is pretty good for this sort of thing.
- Watch for degrees of freedom. Degrees of freedom occur in many tests. We’ve seen this in Chi-squared, ANOVA, and regression. If you think you know what method is being used, check to see if the degrees of freedom match what you are thinking.
- For randomization tests, you are likely to see words like “randomization” or “Monte Carlo” (in reference to the casinos there).
Confidence intervals may be expressed in interval notation

\[(a, b)\] or in plus-minus notation \[\mbox{estimate} \pm \mbox{margin of error}\]
- But be careful, sometimes plus-minus notation is used for estimate \(\pm\) standard error or estimate \(\pm\) standard deviation. Always check to be sure you know which way \(\pm\) is being used.
- Some types of intervals cannot be expressed as estimate \(\pm\) margin of error, especially if the underlying distribution is not symmetric.
Identify the explanatory and response variables and whether they are categorical or quantitative. That will often narrow down the options considerably.
Many (most?) scientific papers include more than a single explanatory variable.
- We haven’t covered these methods. (Take Stat 245 to learn about them!) But you can often extrapolate from what you know to have a pretty good sense for what is going on.
- Some variables may be included not because the researchers are interested in them directly but because they know those variables impact the response and they want to know whether some other variable also affects the response. Such variables are sometimes called covariates. You will also sometimes see the language “after controlling for \(x\)” or “after adjusting for \(x\)”. That usually means that \(x\) was a covariate that needed to be included in the study to more appropriately understand the effect of the explanatory variable of interest.

Some Practice

Let’s try reading some articles/reports. Keep the things above in mind as you answer the questions below.

Note: You won’t need to read the entire article to answer these questions.

Smiling in Yearbook Photos

This isn’t a very earthshaking article, but it is short, easy to understand, and includes the major sections you will find in many scientific papers.

Read the abstract. Based on the abstract alone, what variables do you think the researchers will be collecting for this study?
Now read the Introduction and Method sections. What variables did they actually use? How is that different from what you thought after reading the abstract?
Near the end of the Method section it says “The observers were blind about the hypothesis”. What does this mean and why is it important?
What does \(\chi(2, N = 1183) = 85.25, p <.001\) mean? (This appears in the Results section.)
Convert Table 1 into a table of counts and use it do a Chi-squared test. How does your p-value compare with what is reported in the Results section?
There are two other Chi-squared tests reported. How do they differ from the one using Table 1? How do they differ from each other?
How well does this paper stack up against the seven critical components? Which are fully discussed? partly discussed? missing?

Malaria

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5347610/pdf/pnas.201615324.pdf

This paper is the source of one of our first examples of using a randomization test. (Remember those cards with smiley and sad faces?)

Read the abstract of this paper and answer the following questions

There are two confidence intervals listed in the abstract. Why are they so wide?
Would it be appropriate to use our SE-formula methods to create these confidence intervals? Why or why not?
We could do this by randomization – that’s how we did it when we first encountered this example. But the situation for 1-proportion is so simple that the mathematics of the randomization distribution can be worked our exactly. The distribution involved is called the binomial distribution – bi because there are two options (get malaria or do not get malaria). R has a function called binom.test() that can test hypotheses about 1 proportion and create a confidence interval.
1. Give it a try to see if the results match those in the paper.
```
binom.test(9, 14)
```
2. What command would you use for the other interval? Check that it matches.
There is also a p-value in the abstract, but it isn’t clear to me just what test that is from.
1. What is the null hypothesis for this test?
2. What is the conclusion? (Interpret the p-value.)
3. Why can’t we do a 2-proportion test here?
4. Why can’t we use a Chi-squared test here?
5. The obvious thing (for us) to do here is randomization. As with the binomial test, this situation is simple enough that the null distributon can be determined mathematically without doing randomization. The resulting test is called Fisher’s test or Fisher’s Exact Test. You can do that in R, too. But the p-value doesn’t quite match. (It is quite close.) Give it a try.
```
fisher.test(rbind(c(9,5), c(0,6)))
```
  Note: if two proportions are the same, then the odds are the same, so the ratio of the odds will be 1. That explains how R is describing the alternative hypothesis here.

Alzheimers and \(A \beta\).

https://doi.org/10.1073/pnas.90.20.9649 [There is a link to the full paper as a PDF]

Read the abstract.
1. What are the cases in this study?
2. Why did the researchers choose to look at the APOE gene?
3. What are the most common alleles for this gene? What are the most common genotypes?
Table 1 summarizes information from many hypothesis tests.
1. What is the “most significant” result in the table (the one with the smallest p-value)?
2. What variables were used for this test? Categorical or quantitative?
3. What type of test do you think the researchers did?
4. Find the paragraph(s) in the Results section that correspond to this result. Do we learn anything more about this test from the text than was summarized in the table?
Note:
- The results section is filled with short paragraphs describing sections of Table 1.
This paper uses the older style of indicating (approximate) p-values with symbols rather than just presenting the p-values themselves. The hunt for these symbols among many tests is sometimes called “star gazing” because stars are commonly used markers (more stars for more significance). This can be problematic if you don’t adjust for the number of tests involved.

A paper includes 20 tests. Any test with a p-value below 0.05 is marked with two stars. Suppose that the null hypothesis is true for each one of these tests, what is the probability that none of them will be marked with a star? What is the probability that at least one of them will be marked (even though all the null hypotheses are true)?
How is the previous question related to this xkcd comic? https://xkcd.com/882/
How is this related to Tukey’s Honest Significant Differences (TukeyHSD() in R)?

Gratitude

“Counting Blessings Versus Burdens: An Experimental Investigation of Gratitude and Subjective Well-Being in Daily Life” (R.A. Emmons and M.E. McCullough)

This paper describes 3 separate Experiments. Pick ONE to look at.

Summarize the methods and results of your experiment as best you can.
This paper uses statistical methods that we have not covered in class. How can you assess the validity of the findings, and what the results mean, even without understanding the methods in detail? What do you focus on, and how do you judge whether you agree with the authors’ interpretations? This is an important skill, because there will always be statistical methods you aren’t familiar with in some of the papers you need to read!
There are few figures in the paper, and the tables are quite simple. Is this good or bad? Are there things that could have been added or improved?
Why, in your faith life, do you think you should practice gratitude? Do you have particular scripture passages or other references that guide you on this point?
Why, according to this article, should a person should practice gratitude?
How do you think the two answers above might work together, or contradict each other, or illuminate different aspects of human experience?

Seabirds and an oil spill

Please read the scientific paper below, and also a short corresponding news piece that covered it:

“Long-term reproductive impairment in a seabird after the Prestige oil spill” (Barros et al., https://doi.org/10.1098/rsbl.2013.1041)
“Bird reproduction collapsed after oil spill” (https://www.nature.com/news/bird-reproduction-collapsed-after-oil-spill-1.15130)

As you read, you may come across statistical methods and details that you do not yet understand. Don’t worry about figuring out every term and abbreviation. Do keep your eyes peeled for things you do already know about. Here, you will focus on randomization tests and standardized test statistics of the form \[\frac{\mbox{data value} \; - \; \mbox{hypothesis value}}{\mbox{SE}}\]

In the article, there are 3 examples of “randomization tests” – that is, hypothesis tests that were carried out by creating randomization distributions, then computing p-values (like we have done many times).

For ONE of them, do the following.
1. Report what you think are the null and alternate hypotheses.
2. Say whether \(H_0\) is rejected.
3. State in words, in terms of birds and oil, what that means.
The Barros et al. article reports,

Indeed, in the oiled area the number of chicks fledged per pair was reduced by 40% after the spill (randomization test, p = 0.0019; figure 1d).

Try to think of one or more lurking variables that might affect this result. That is, what else might have changed between those two time periods, other than just the presence/absence of the oil. (See Discussion for ideas if needed.)
The news article says that

Reproductive success was 45% lower in oiled populations compared with unoiled colonies, whereas it had been much the same before the spill. The researchers measured reproductive success by counting how many fully grown young emerged from each nest. This number averaged 1.6 for both oiled and control colonies before the spill. Afterwards, while the control colonies maintained the 1.6 figure, the number for the birds in the oiled colonies dropped to 1.0.

Choose 2 claims from this statement and try to match them up with the corresponding tests/p-values/confidence intervals from the research article.

Covid-19 Vaccines

Getting good information about the testing of the Covid-19 vaccines is a bit challenging. You’ve probably heard numbers like 90% or 95% effective, but what does that mean, how is it calculated, and how accurately do we know that number. (You should know enough now to expect that there should be a confidence interval to quantify the precision of this estimate.)

Here are some press releases – information put out by the companies are released to news agencies.

And here are some interesting background articles

The ACHI article says

Strictly speaking, “efficacy” and “effectiveness” have different meanings in the context of vaccine development, despite often being used interchangeably in news stories.

What is the difference between “efficacy” and “effectiveness”?
The information you just read still doesn’t tell you what number is calculated for efficacy. This paper about flu vaccines gives us enough information to figure it out. It says:

In 1997-1998, 3 (2.2%) of 138 vaccine recipients and 6 (4.4%) of 137 placebo recipients had laboratory-confirmed influenza illness (vaccine efficacy, 50%; P = .33). In 1998-1999, 2 (1%) of 141 vaccine recipients and 14 (10%) of 137 placebo recipients had influenza illness (vaccine efficacy, 86%; P = .001). Vaccine efficacy was 89% (P = .001) against influenza A/Sydney/5/97 and 60% (P = .06) against influenza B/Beijing/184/93.
1. Show how to calculate the first two vaccine efficacies listed (50% and 86%).
  - Hint: How are the proportions related to efficacy?
  - Hint: There is a Wikipedia page for “Vaccine Efficcy” if you get stuck.
  - Note: Use the raw counts in your calculation to avoid round-off errors.
2. What happens to efficacy if there are more people in the vaccine group that get the flu than in the placebo group?
3. By each of those two efficacies is a p-value. What is the null hypothesis for this p-value? (What would the efficacy be if the vaccine had no effect?)

Let’s get R to calculate that efficacy for us. First, let’s create a data frame with the data.

Flu1999 <-
  bind_rows(
    do(2)   * tibble(vaccine = "yes", flu = "flu"),
    do(139) * tibble(vaccine = "yes", flu = "no flu"),
    do(14)  * tibble(vaccine = "no",  flu = "flu"),
    do(123) * tibble(vaccine = "no", flu = "no flu")
  )
tally(flu ~ vaccine, data = Flu1999)

##         vaccine
## flu       no yes
##   flu     14   2
##   no flu 123 139

Create Flu1998 in a similar way. Use tally() to make sure you are getting the correct table of values.

Next, let’s write a function to compute efficacy from a 2x2 table of the form \(\begin{matrix}a & c \\ b & d\end{matrix}\). You just need to edit the last line below. (Currently it is computing the difference in proportions.)

efficacy <- function(x) {
  a <- x[1,1]
  b <- x[2,1]
  c <- x[1,2]
  d <- x[2,2]
  # change the next line so it compute efficacy instead of diffprop
  (c / (c + d)) - (a / (a + b))
}

Check that it computes the correct values on Flu1998 and Flu1999

tally(flu ~ vaccine, data = Flu1999) %>% efficacy()

## [1] 0.8611955

tally(flu ~ vaccine, data = Flu1998) %>% efficacy()

## [1] 0.4929078

Now you have everything you need to create randomization and bootstrap distributions. Use them to compute the p-value and confidence interval for each year. (Don’t forget to use tally().)
Now let’s do the same for one of the Covid-19 vaccines. To do that, you will need to look in one of the press releases to find out how many cases there were in each group (vaccinated and unvaccinated) and how many people there were in each group. If the press release only gives the total number of people, you may assume that the two groups had the same size. (That’s probably not exactly correct, but it is likely very close.)
Based on your results, do you think it is fair to advertise efficacy the way the vaccine producers have been doing? Why or why not?

Notes:

You can create new methods for p-values and confidence intervals for all sorts of situations in this same way. You just need a function to calculate the appropriate test statistic or estimate and a way to do the appropriate simulation. The general approach is very flexible. (The formula method would require you to find or derive a new formula, something that is typically much harder.)
The R package epitools can automate most of what we just did, but it works with relative risk (the ratio of two proportions) rather than efficacy. Difference in proportions, efficacy, relative risk, and odds ratio (the ratio of two odds) are four different ways to compare two proportions.