Set Up

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggformula)
theme_set(theme_bw())

JHU COVID-19 data

As you might imagine, keeping a comprehensive list of all COVID-19 cases worldwide involves pulling data from numerous sources (and frequent updating). Fortunately, some folks at Johns Hopkins have been doing that work and putting the resulting data into a github repository that anyone can access.

Finding some data

You could clone the repository, but you also just pull the data directly from their repository. They’ve split the data into “daily reports” (one CSV per day, all measures) and “time series” (one CSV per measure, all days). Here is an example page showing one of the data sets available to you.

We want raw data

GitHub renders CSVs in a fancy way, but you can get the plain old CSV if you click the Raw button. We’re mostly interested in the URL for this file, since that will let us pull the data into R. This repo is well organized, and the names have been systematically chosen (also long). We can create URLs for multiple different files all in one go by taking advantage of the naming scheme.

base_url <-           # it's long, so I'm pasting it together to avoid ugly long lines
  paste0('https://raw.githubusercontent.com/',
         'CSSEGISandData/COVID-19/master/',
         'csse_covid_19_data/csse_covid_19_time_series/')

# not done yet, now we need to add the file name from that folder
filename <- 
  paste0('time_series_covid19_', c('confirmed', 'deaths', 'recovered'), '_global.csv')

# OK. Let's put it all together
url <- paste0(base_url, filename)
url
## [1] "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
## [2] "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"   
## [3] "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"

Let’s read in the first one. (We could be fancier and read all three at once.)

Confirmed <- url[1] %>% 
  read_csv(col_types = cols(
    .default = col_double(),
    `Province/State` = col_character(),
    `Country/Region` = col_character()
  )) %>%
  rename(country_or_region = `Country/Region`) # renaming avoids needing backticks
                                               # also avoids a bug in ggformula

Exploring the data

Let’s a take a look at what we have:

reactable::reactable(Confirmed, searchable = TRUE)

Notice that each day’s count of confirmed cases is in a separate column. For some purposes, it would be nicer to have a column for date and a column for count instead. That will make for a much longer (and narrower) data set. We can get that using tidyr::pivot_longer().

Confirmed_long <-
  Confirmed %>% 
  pivot_longer(
    -(1:4),                        # the first 4 columns are not part of the pivot
    names_to = "date",             # names of the remaining columns will be put into a date column
    values_to = "confirmed") %>%   # values will be put into a column called confirmed
  mutate(date = lubridate::parse_date_time(date, "%m/%d/%y!*"))  # convert to date objects

reactable::reactable(Confirmed_long)

Plotting the data

Here are two relatively simple plots.

Worldwide cases

To see the worldwide number of confirmed cases over time, we can group the data by date and add up all the confirmed counts.

Confirmed_long %>%
  group_by(date) %>% 
  summarise(confirmed = sum(confirmed)) %>% 
  gf_point(confirmed ~ date)

Cases by Country

Confirmed_long %>%
  filter(country_or_region %in% 
           c("US", "China", "Japan", "Korea, South", "Italy", "Germany", "Spain")) %>%
  group_by(country_or_region, date) %>% 
  summarise(confirmed = sum(confirmed)) %>% 
  gf_point(confirmed ~ date) %>%
  gf_facet_grid(country_or_region ~ ., scales = "free")

Your Turn

Note: This assignment is not meant to increase your stress level. If working with COVID-19 is stressful for you, contact the instructors about using some other data.

  1. Create a plot or plots using data from the JHU github repository or some other source. Do something more interesting than the two basic plots above.

    1. If you need inspiration, there are lots of plots (both good and bad) online. Find a good one and see if you can replicate it. Find a bad one and see if you can do soemthing better. (Share the sites you find in our MS Team chat.)

    2. You don’t need to use the JHU data. Feel free to use data you locate in other places. Post those places in the MS Team as well.

    3. You are encouraged to use data from multiple files/sources. dplyr::join() and its cousins are useful for combining data, but be careful if you use data from multiple sites and they code things (like country names) in different ways.

    4. You may use ggformula, ggplot2, or some other tool to create our plot. If you’re already comfortable in R’s plotting tools, you could take this opportunity to try a new tool like Charticulator or plotly, a javascript plotting library with wrappers in R and Python.

  2. Copy and paste your favorite plot(s) into a new page on this Google Doc.

  3. Write a little bit about what your plot. At minimum include

    1. Your name.
    2. Sources for your data.
    3. What interesting things you can learn from your data visualization. (Tell a story.)
    4. Ways you wish you could visualize the data even better, but can’t (either because you don’t know how to do it in R or some other tool, or because you can’t find some data that you would need).
  4. Add your plot to our shared repository. Make a folder for yourself under vis and add your rendered plot there, with source code if applicable.

  5. Read this article about responsible visualization of Covid-19 data. Think about how this relates to our earlier conversations about data ethics and respond to the following questions

    1. What were the most interesting or thought provoking things for you from this article?

    2. How does your plot for HW 9 stack up against the 10 considerations?

      Where do you do well and not so well (according to the author)? Are there any weaknesses that would be within your power to improve on, or did you do the best you could in your situation? [You don’t need to do to a blow-by-blow of all 10 points; pick the most important/interesting ones.]

    3. People matter. In what ways does the author say that two kinds of people matter: the audience and collaborators?

    4. Take another look at the ACM Code of Ethics (poster version here: http://www.acm.org/binaries/content/assets/membership/images2/fac-stu-poster-code.pdf). Find one or two points of alignment between the two (places where the article reflects aspects of the code).