knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggformula)
theme_set(theme_bw())
As you might imagine, keeping a comprehensive list of all COVID-19 cases worldwide involves pulling data from numerous sources (and frequent updating). Fortunately, some folks at Johns Hopkins have been doing that work and putting the resulting data into a github repository that anyone can access.
You could clone the repository, but you also just pull the data directly from their repository. They’ve split the data into “daily reports” (one CSV per day, all measures) and “time series” (one CSV per measure, all days). Here is an example page showing one of the data sets available to you.
GitHub renders CSVs in a fancy way, but you can get the plain old CSV if you click the Raw button. We’re mostly interested in the URL for this file, since that will let us pull the data into R. This repo is well organized, and the names have been systematically chosen (also long). We can create URLs for multiple different files all in one go by taking advantage of the naming scheme.
base_url <- # it's long, so I'm pasting it together to avoid ugly long lines
paste0('https://raw.githubusercontent.com/',
'CSSEGISandData/COVID-19/master/',
'csse_covid_19_data/csse_covid_19_time_series/')
# not done yet, now we need to add the file name from that folder
filename <-
paste0('time_series_covid19_', c('confirmed', 'deaths', 'recovered'), '_global.csv')
# OK. Let's put it all together
url <- paste0(base_url, filename)
url
## [1] "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
## [2] "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
## [3] "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv"
Let’s read in the first one. (We could be fancier and read all three at once.)
Confirmed <- url[1] %>%
read_csv(col_types = cols(
.default = col_double(),
`Province/State` = col_character(),
`Country/Region` = col_character()
)) %>%
rename(country_or_region = `Country/Region`) # renaming avoids needing backticks
# also avoids a bug in ggformula
Let’s a take a look at what we have:
reactable::reactable(Confirmed, searchable = TRUE)
Notice that each day’s count of confirmed cases is in a separate column. For some purposes, it would be nicer to have a column for date and a column for count instead. That will make for a much longer (and narrower) data set. We can get that using tidyr::pivot_longer()
.
Confirmed_long <-
Confirmed %>%
pivot_longer(
-(1:4), # the first 4 columns are not part of the pivot
names_to = "date", # names of the remaining columns will be put into a date column
values_to = "confirmed") %>% # values will be put into a column called confirmed
mutate(date = lubridate::parse_date_time(date, "%m/%d/%y!*")) # convert to date objects
reactable::reactable(Confirmed_long)
Here are two relatively simple plots.
To see the worldwide number of confirmed cases over time, we can group the data by date and add up all the confirmed counts.
Confirmed_long %>%
group_by(date) %>%
summarise(confirmed = sum(confirmed)) %>%
gf_point(confirmed ~ date)
Confirmed_long %>%
filter(country_or_region %in%
c("US", "China", "Japan", "Korea, South", "Italy", "Germany", "Spain")) %>%
group_by(country_or_region, date) %>%
summarise(confirmed = sum(confirmed)) %>%
gf_point(confirmed ~ date) %>%
gf_facet_grid(country_or_region ~ ., scales = "free")
Note: This assignment is not meant to increase your stress level. If working with COVID-19 is stressful for you, contact the instructors about using some other data.
Create a plot or plots using data from the JHU github repository or some other source. Do something more interesting than the two basic plots above.
If you need inspiration, there are lots of plots (both good and bad) online. Find a good one and see if you can replicate it. Find a bad one and see if you can do soemthing better. (Share the sites you find in our MS Team chat.)
You don’t need to use the JHU data. Feel free to use data you locate in other places. Post those places in the MS Team as well.
You are encouraged to use data from multiple files/sources. dplyr::join()
and its cousins are useful for combining data, but be careful if you use data from multiple sites and they code things (like country names) in different ways.
You may use ggformula
, ggplot2
, or some other tool to create our plot. If you’re already comfortable in R’s plotting tools, you could take this opportunity to try a new tool like Charticulator or plotly, a javascript plotting library with wrappers in R and Python.
Copy and paste your favorite plot(s) into a new page on this Google Doc.
Write a little bit about what your plot. At minimum include
Add your plot to our shared repository. Make a folder for yourself under vis
and add your rendered plot there, with source code if applicable.
Read this article about responsible visualization of Covid-19 data. Think about how this relates to our earlier conversations about data ethics and respond to the following questions
What were the most interesting or thought provoking things for you from this article?
How does your plot for HW 9 stack up against the 10 considerations?
Where do you do well and not so well (according to the author)? Are there any weaknesses that would be within your power to improve on, or did you do the best you could in your situation? [You don’t need to do to a blow-by-blow of all 10 points; pick the most important/interesting ones.]
People matter. In what ways does the author say that two kinds of people matter: the audience and collaborators?
Take another look at the ACM Code of Ethics (poster version here: http://www.acm.org/binaries/content/assets/membership/images2/fac-stu-poster-code.pdf). Find one or two points of alignment between the two (places where the article reflects aspects of the code).