Home

This page provides a number of mini-examples demonstrating how to do specific tasks. Use the menu to select the topic you are interested in to view an example related to that topic.

Note: Depending on your broswer, you may need to scroll to see all of the menu items.

Data in R Packages

To use a data set in a package, you must either load the entire package or select just the data set you need.

library(mosaicData)                     # load a package (and all its datasets)
data(KidsFeet, package = "mosaicData")  # grab just one data set from a package
find("KidsFeet")                        # see if there is a loaded package w/ KidsFeet

[1] ".GlobalEnv"         "package:mosaicData"

Notes

To use a dataset in an R Markdown document, you must load the package in the R Markdown document. A good place for this is in the setup chunk.
Some packages, including oibiostat require you to use data() any time you want to load a data set, but most will make all their data sets available as soon as you load the package.

Importing CSV files

library(readr)    # or require(readr)
Crime <- read_csv("../../data/CampusCrime.csv")
names(Crime)

[1] "region"              "type"                "enrollment"         
[4] "property_crimes"     "violent_crimes"      "property_crime_rate"
[7] "violent_crime_rate"  "overall_crime"

In RStudio you can also use “Import Dataset” in the Environment pane.

Importing Excel spreadsheets

library(readxl)      # or require(readxl)
Crime2 <- read_excel("../../data/CampusCrime.xlsx", sheet = 1)
names(Crime2)

[1] "region"              "type"                "enrollment"         
[4] "property_crimes"     "violent_crimes"      "property_crime_rate"
[7] "violent_crime_rate"  "overall_crime"

In RStudio you can also use “Import Dataset” in the Environment pane.

Importing Google spreadsheets

If you have google in a Google Spreadsheet, you have some options.

Tabset

Export to CSV or Excel

You can export the file as a CSV or excel file from Google and then import that into R.

Direct Loading from Google

You can also load data directly from Google. First, in the File menu on Google Sheets, choose “Publish to the Web”. Publish as a CSV file. Read in the file by providing the resulting URL (in quotes) to read.csv().

Use the googlesheets4 package

There is also a googlesheets4 package which provides direct access to Google spreadsheets. The details for using Google sheets change from time to time, so it’s best to consult the googlesheets4 package documentation.

Taking a look at your data

There are a number of different ways to you can inspect your data to learn about it.

Tabset

names()

names() lists the variables

names(Crime)   # what variables do we have?

[1] "region"              "type"                "enrollment"         
[4] "property_crimes"     "violent_crimes"      "property_crime_rate"
[7] "violent_crime_rate"  "overall_crime"

dim(); nrow()

dim() and nrow() tell us how large our data set is

nrow(Crime)    # how many rows (cases) in the data?

[1] 81

dim(Crime)     # how many rows (cases) and columns (variables) in the data?

[1] 81  8

glimpse()

glimpse() is for seeing the first few values of each variable

glimpse(Crime) # first few values for each variable

Rows: 81
Columns: 8
$ region              <chr> "SE", "SE", "W", "W", "SW", "SW", "W", "W", "C", …
$ type                <chr> "U", "C", "U", "C", "U", "U", "U", "C", "U", "C",…
$ enrollment          <dbl> 5590, 540, 35747, 28176, 10568, 3127, 20675, 1254…
$ property_crimes     <dbl> 266, 10, 1233, 210, 116, 29, 284, 43, 660, 94, 93…
$ violent_crimes      <dbl> 30, 0, 23, 1, 1, 0, 7, 0, 19, 4, 3, 6, 26, 23, 7,…
$ property_crime_rate <dbl> 475.84973, 185.18519, 344.92405, 74.53152, 109.76…
$ violent_crime_rate  <dbl> 53.6672630, 0.0000000, 6.4341064, 0.3549120, 0.94…
$ overall_crime       <dbl> 296, 10, 1256, 211, 117, 29, 291, 43, 679, 98, 96…

inspect()

inspect() summarizes each variable

inspect(Crime) # summary information about each variable


categorical variables:  
    name     class levels  n missing
1 region character      6 81       0
2   type character      2 81       0
                                   distribution
1 NE (25.9%), C (21%), SE (18.5%) ...          
2 U (65.4%), C (34.6%)                         

quantitative variables:  
                    name   class        min           Q1     median
...1          enrollment numeric 540.000000 4638.0000000 11321.0000
...2     property_crimes numeric   8.000000   64.0000000   120.0000
...3      violent_crimes numeric   0.000000    1.0000000     3.0000
...4 property_crime_rate numeric   8.639309   94.2211055   141.9499
...5  violent_crime_rate numeric   0.000000    0.7491759     2.4888
...6       overall_crime numeric   9.000000   65.0000000   127.0000
               Q3         max         mean           sd  n missing
...1 22396.000000 46597.00000 13899.308642 10501.866624 81       0
...2   293.000000  1293.00000   240.790123   270.200653 81       0
...3     8.000000    30.00000     5.938272     7.404974 81       0
...4   208.581416   475.84973   160.981380    97.971006 81       0
...5     6.434106    53.66726     4.978590     7.906354 81       0
...6   301.000000  1311.00000   246.728395   275.486389 81       0

summary()

summary() provides another summary of each variable

summary(Crime) # summary information about each variable

    region              type             enrollment    property_crimes 
 Length:81          Length:81          Min.   :  540   Min.   :   8.0  
 Class :character   Class :character   1st Qu.: 4638   1st Qu.:  64.0  
 Mode  :character   Mode  :character   Median :11321   Median : 120.0  
                                       Mean   :13899   Mean   : 240.8  
                                       3rd Qu.:22396   3rd Qu.: 293.0  
                                       Max.   :46597   Max.   :1293.0  
 violent_crimes   property_crime_rate violent_crime_rate overall_crime   
 Min.   : 0.000   Min.   :  8.639     Min.   : 0.0000    Min.   :   9.0  
 1st Qu.: 1.000   1st Qu.: 94.221     1st Qu.: 0.7492    1st Qu.:  65.0  
 Median : 3.000   Median :141.950     Median : 2.4888    Median : 127.0  
 Mean   : 5.938   Mean   :160.981     Mean   : 4.9786    Mean   : 246.7  
 3rd Qu.: 8.000   3rd Qu.:208.581     3rd Qu.: 6.4341    3rd Qu.: 301.0  
 Max.   :30.000   Max.   :475.850     Max.   :53.6673    Max.   :1311.0

View()

View() provides a “spreadsheet view” of the data

If you are working interactively (not in an RMarkdown document), you might try using View(). This gives you a spread-sheet like view that you can sort and filter (but cannot edit).

View(Crime)

The result will look something (but not quite exaclty) like this in RStudio:

pander()

pander() provides “pretty” data in R Markdown

pander() prints a “pretty” version of your data set and works in both PDF and HTML formats.

library(pander)
pander(KidsFeet)

name	birthmonth	birthyear	length	width	sex	biggerfoot	domhand
David	5	88	24.4	8.4	B	L	R
Lars	10	87	25.4	8.8	B	L	L
Zach	12	87	24.5	9.7	B	R	R
Josh	1	88	25.2	9.8	B	L	R
Lang	2	88	25.1	8.9	B	L	R
Scotty	3	88	25.7	9.7	B	R	R
Edward	2	88	26.1	9.6	B	L	R
Caitlin	6	88	23	8.8	G	L	R
Eleanor	5	88	23.6	9.3	G	R	R
Damon	9	88	22.9	8.8	B	R	L
Mark	9	87	27.5	9.8	B	R	R
Ray	3	88	24.8	8.9	B	L	R
Cal	8	87	26.1	9.1	B	L	R
Cam	3	88	27	9.8	B	L	R
Julie	11	87	26	9.3	G	L	R
Kate	4	88	23.7	7.9	G	R	R
Caroline	12	87	24	8.7	G	R	L
Maggie	3	88	24.7	8.8	G	R	R
Lee	6	88	26.7	9	G	L	L
Heather	3	88	25.5	9.5	G	R	R
Andy	6	88	24	9.2	B	R	R
Josh	7	88	24.4	8.6	B	L	R
Laura	9	88	24	8.3	G	R	L
Erica	9	88	24.5	9	G	L	R
Peggy	10	88	24.2	8.1	G	L	R
Glen	7	88	27.1	9.4	B	L	R
Abby	2	88	26.1	9.5	G	L	R
David	12	87	25.5	9.5	B	R	R
Mike	11	88	24.2	8.9	B	L	R
Dwayne	8	88	23.9	9.3	B	R	L
Danielle	6	88	24	9.3	G	L	R
Caitlin	7	88	22.5	8.6	G	R	R
Leigh	3	88	24.5	8.6	G	L	R
Dylan	4	88	23.6	9	B	R	L
Peter	4	88	24.7	8.6	B	R	L
Hannah	3	88	22.9	8.5	G	L	R
Teshanna	3	88	26	9	G	L	R
Hayley	1	88	21.6	7.9	G	R	R
Alisha	9	88	24.6	8.8	G	L	R

KidsFeet %>% pander()

name	birthmonth	birthyear	length	width	sex	biggerfoot	domhand
David	5	88	24.4	8.4	B	L	R
Lars	10	87	25.4	8.8	B	L	L
Zach	12	87	24.5	9.7	B	R	R
Josh	1	88	25.2	9.8	B	L	R
Lang	2	88	25.1	8.9	B	L	R
Scotty	3	88	25.7	9.7	B	R	R
Edward	2	88	26.1	9.6	B	L	R
Caitlin	6	88	23	8.8	G	L	R
Eleanor	5	88	23.6	9.3	G	R	R
Damon	9	88	22.9	8.8	B	R	L
Mark	9	87	27.5	9.8	B	R	R
Ray	3	88	24.8	8.9	B	L	R
Cal	8	87	26.1	9.1	B	L	R
Cam	3	88	27	9.8	B	L	R
Julie	11	87	26	9.3	G	L	R
Kate	4	88	23.7	7.9	G	R	R
Caroline	12	87	24	8.7	G	R	L
Maggie	3	88	24.7	8.8	G	R	R
Lee	6	88	26.7	9	G	L	L
Heather	3	88	25.5	9.5	G	R	R
Andy	6	88	24	9.2	B	R	R
Josh	7	88	24.4	8.6	B	L	R
Laura	9	88	24	8.3	G	R	L
Erica	9	88	24.5	9	G	L	R
Peggy	10	88	24.2	8.1	G	L	R
Glen	7	88	27.1	9.4	B	L	R
Abby	2	88	26.1	9.5	G	L	R
David	12	87	25.5	9.5	B	R	R
Mike	11	88	24.2	8.9	B	L	R
Dwayne	8	88	23.9	9.3	B	R	L
Danielle	6	88	24	9.3	G	L	R
Caitlin	7	88	22.5	8.6	G	R	R
Leigh	3	88	24.5	8.6	G	L	R
Dylan	4	88	23.6	9	B	R	L
Peter	4	88	24.7	8.6	B	R	L
Hannah	3	88	22.9	8.5	G	L	R
Teshanna	3	88	26	9	G	L	R
Hayley	1	88	21.6	7.9	G	R	R
Alisha	9	88	24.6	8.8	G	L	R

reactable()

reactable() for sortable HTML data tables

reactable() only works in HTML, but it provides a nice table of the data that allows you to sort columns and page through the data.

library(reactable)
reactable(Crime)

Crime %>% reactable()

Mean

Tabset

Mean

Don’t forget na.rm = TRUE if you have missing data.

library(palmerpenguins)
mean( ~ bill_length_mm, data = penguins)      # missing values!

[1] NA

mean( ~ bill_length_mm, data = penguins, na.rm = TRUE)

[1] 43.92193

Mean (by groups)

Mean by groups

Don’t forget na.rm = TRUE if you have missing data.

library(palmerpenguins)
mean( bill_length_mm ~ species, data = penguins, na.rm = TRUE)

   Adelie Chinstrap    Gentoo 
 38.79139  48.83382  47.50488

mean( ~ bill_length_mm | species, data = penguins, na.rm = TRUE)

   Adelie Chinstrap    Gentoo 
 38.79139  48.83382  47.50488

mean( ~ bill_length_mm, group = ~ species, data = penguins, na.rm = TRUE)

   Adelie Chinstrap    Gentoo 
 38.79139  48.83382  47.50488

Standard Deviation

Tabset

Standard Deviation

Don’t forget na.rm = TRUE if you have missing data.

library(palmerpenguins)
sd( ~ bill_length_mm, data = penguins)      # missing values!

[1] NA

sd( ~ bill_length_mm, data = penguins, na.rm = TRUE)

[1] 5.459584

Standard Deviation by groups

library(palmerpenguins)
sd( bill_length_mm ~ species, data = penguins, na.rm = TRUE)

   Adelie Chinstrap    Gentoo 
 2.663405  3.339256  3.081857

sd( ~ bill_length_mm | species, data = penguins, na.rm = TRUE)

   Adelie Chinstrap    Gentoo 
 2.663405  3.339256  3.081857

sd( ~ bill_length_mm, group = ~ species, data = penguins, na.rm = TRUE)

   Adelie Chinstrap    Gentoo 
 2.663405  3.339256  3.081857

You have three options:

sd( Sepal.Length ~ Species, data = iris)

    setosa versicolor  virginica 
 0.3524897  0.5161711  0.6358796

sd( ~ Sepal.Length | Species, data = iris)

    setosa versicolor  virginica 
 0.3524897  0.5161711  0.6358796

sd( ~ Sepal.Length, group = Species, data = iris)

    setosa versicolor  virginica 
 0.3524897  0.5161711  0.6358796

Variance

var() works just like sd() and computes the variance (square of standard deviation).

var( ~ Sepal.Length, data = iris)

[1] 0.6856935

var( Sepal.Length ~ Species, data = iris)

    setosa versicolor  virginica 
 0.1242490  0.2664327  0.4043429

var( ~ Sepal.Length | Species, data = iris)

    setosa versicolor  virginica 
 0.1242490  0.2664327  0.4043429

var( ~ Sepal.Length, group = Species, data = iris)

    setosa versicolor  virginica 
 0.1242490  0.2664327  0.4043429

Median, IQR, max, min, etc.

Other statistics work just like mean and standard deviation.

median(~Sepal.Length, data=iris)

[1] 5.8

iqr(~Sepal.Length, data=iris)

[1] 1.3

max(~Sepal.Length, data=iris)

[1] 7.9

min(~Sepal.Length, data=iris)

[1] 4.3

df_stats()

Tabset

Pick your summaries

df_stats() provides a convenient way to produce numerical summaries in a tabular format. After the formula and data, just list the summary statistics you want.

library(palmerpenguins)
df_stats( ~ bill_depth_mm, data = penguins, mean, sd, min, max)

       response     mean       sd  min  max
1 bill_depth_mm 17.15117 1.974793 13.1 21.5

Custom names

You can choose custom names if you like.

library(palmerpenguins)
df_stats( ~ bill_length_mm, data = penguins, 
          x.bar = mean, sd = sd, shortest = min, longest = max)

        response    x.bar       sd shortest longest
1 bill_length_mm 43.92193 5.459584     32.1    59.6

Within group summaries

We can compute our statisics within groups.

library(palmerpenguins)
df_stats( bill_length_mm ~ species, data = penguins, x_bar = mean, sd = sd)

        response   species    x_bar       sd
1 bill_length_mm    Adelie 38.79139 2.663405
2 bill_length_mm Chinstrap 48.83382 3.339256
3 bill_length_mm    Gentoo 47.50488 3.081857

df_stats( ~ bill_length_mm | species, data = penguins, x_bar = mean, sd = sd)

        response   species    x_bar       sd
1 bill_length_mm    Adelie 38.79139 2.663405
2 bill_length_mm Chinstrap 48.83382 3.339256
3 bill_length_mm    Gentoo 47.50488 3.081857

Counts and proportions

For categorical data, we probably want to compute counts or proportions.

library(palmerpenguins)
df_stats( ~ species, data = penguins, counts, props)

  response n_Adelie n_Chinstrap n_Gentoo prop_Adelie prop_Chinstrap prop_Gentoo
1  species      152          68      124   0.4418605      0.1976744   0.3604651

df_stats( ~ substance, data = HELPrct, counts, props)

   response n_alcohol n_cocaine n_heroin prop_alcohol prop_cocaine prop_heroin
1 substance       177       152      124    0.3907285    0.3355408   0.2737307

df_stats( substance ~ sex, data = HELPrct, counts, props)

   response    sex n_alcohol n_cocaine n_heroin prop_alcohol prop_cocaine
1 substance female        36        41       30    0.3364486    0.3831776
2 substance   male       141       111       94    0.4075145    0.3208092
  prop_heroin
1   0.2803738
2   0.2716763

Multiple reponse variables

We can compute summary statistics for multiple reponse variables at once.

library(palmerpenguins)
df_stats( bill_length_mm + bill_depth_mm ~ species, data = penguins, mean, sd, median, iqr)

        response   species     mean        sd median   iqr
1 bill_length_mm    Adelie 38.79139 2.6634048  38.80 4.000
2 bill_length_mm Chinstrap 48.83382 3.3392559  49.55 4.725
3 bill_length_mm    Gentoo 47.50488 3.0818574  47.30 4.250
4  bill_depth_mm    Adelie 18.34636 1.2166498  18.40 1.500
5  bill_depth_mm Chinstrap 18.42059 1.1353951  18.45 1.900
6  bill_depth_mm    Gentoo 14.98211 0.9812198  15.00 1.500

Favorites

If you don’t provide any statistics, you get some favorites:

library(palmerpenguins)
df_stats( ~ bill_length_mm, data = penguins)

        response  min     Q1 median   Q3  max     mean       sd   n missing
1 bill_length_mm 32.1 39.225  44.45 48.5 59.6 43.92193 5.459584 342       2

df_stats( ~ bill_length_mm | species, data = penguins)

        response   species  min    Q1 median     Q3  max     mean       sd   n
1 bill_length_mm    Adelie 32.1 36.75  38.80 40.750 46.0 38.79139 2.663405 151
2 bill_length_mm Chinstrap 40.9 46.35  49.55 51.075 58.0 48.83382 3.339256  68
3 bill_length_mm    Gentoo 40.9 45.30  47.30 49.550 59.6 47.50488 3.081857 123
  missing
1       1
2       0
3       1

favstats()

favstats() is similar to df_stats() without any statistics specified.

favstats( ~ bill_length_mm, data = penguins)

  min     Q1 median   Q3  max     mean       sd   n missing
 32.1 39.225  44.45 48.5 59.6 43.92193 5.459584 342       2

favstats( ~ bill_length_mm | species, data = penguins)

    species  min    Q1 median     Q3  max     mean       sd   n missing
1    Adelie 32.1 36.75  38.80 40.750 46.0 38.79139 2.663405 151       1
2 Chinstrap 40.9 46.35  49.55 51.075 58.0 48.83382 3.339256  68       0
3    Gentoo 40.9 45.30  47.30 49.550 59.6 47.50488 3.081857 123       1

Correlation Coefficient

Tabset

Basic

Basic correlation

The correlation coefficient is a bit different because it requires two quantitative variables, so our basic formula shape is y ~ x.

# Correlation coefficient is the same either way around
cor(width ~ length, data = KidsFeet)

[1] 0.6410961

cor(length ~ width, data = KidsFeet)

[1] 0.6410961

Missing Data

Dealing with missing data

Dealing with missing data when computing a correlation coefficient is a bit different from how we did it for our other numerical summaries because (a) there are two variables involved instead of just one, and (b) the cor() function is set up to compute many correlation coefficients all at once, so there are several different rules that could be used for handling missing data.

In short: instead of na.rm = TRUE, we use use = "complete.obs" or use = "pairwise.complete.obs".

cor(dht ~ ht, data = Gestation)

[1] NA

cor(dht ~ ht, data = Gestation, use = "complete.obs")

[1] 0.3408627

Multi-tasking

Many correlations at once

If a data frame has only numeric variables (or if we select only numeric variables), then cor() will compute the correlation coefficients of all pairs of variables at once an return them in a table.

Now 'complete.obs' and 'pairwise.complete.obs' might be different.

cor(Gestation %>% select(gestation, wt, age, ht, dht), use = "complete.obs")

            gestation         wt         age         ht         dht
gestation  1.00000000 0.39189771 -0.06264927 0.02960305  0.01128574
wt         0.39189771 1.00000000  0.03250378 0.21198453  0.10433742
age       -0.06264927 0.03250378  1.00000000 0.01821807 -0.05911042
ht         0.02960305 0.21198453  0.01821807 1.00000000  0.34239424
dht        0.01128574 0.10433742 -0.05911042 0.34239424  1.00000000

cor(Gestation %>% select(gestation, wt, age, ht, dht), use = "pairwise.complete.obs")

            gestation         wt          age          ht         dht
gestation  1.00000000 0.40785397 -0.056269348 0.063630776  0.00572266
wt         0.40785397 1.00000000  0.033648749 0.198518671  0.10771129
age       -0.05626935 0.03364875  1.000000000 0.003050816 -0.05177558
ht         0.06363078 0.19851867  0.003050816 1.000000000  0.34086270
dht        0.00572266 0.10771129 -0.051775577 0.340862699  1.00000000

Correlation Plots

If we want to get fancy, we can turn this table into a colorized plot. The ggcorrplot package includes a function to do all the work for us. Here are some examples.

library(ggcorrplot)
corr <- 
  cor(Gestation %>% select(gestation, wt, age, ht, dht), use = "complete.obs")
ggcorrplot(corr, lab = TRUE)

ggcorrplot(corr, lab = TRUE, type = "lower")

ggcorrplot(corr, lab = TRUE, method = "circle")

Proportions

Tabset

Proportions

Logical operators are often useful for creating proportions.

equals: == (note double equal for checking if things are equal)
does not equal: !=
is greater than (less than): > (<)
is greater than (less than) or equal to: >= (<=)

prop( ~ (Species == 'setosa'), data = iris)

prop_TRUE 
0.3333333

Proportions by groups

Logical operators are often useful for creating proportions.

equals: == (note double equal for checking if things are equal)
does not equal: !=
is greater than (less than): > (<)
is greater than (less than) or equal to: >= (<=)

We can compute proportions within groups:

prop(~ (homeless == 'housed') | substance, data = HELPrct)

prop_TRUE.alcohol prop_TRUE.cocaine  prop_TRUE.heroin 
        0.4180791         0.6118421         0.6209677

prop(~ (age > 30) | substance, data = HELPrct)

prop_TRUE.alcohol prop_TRUE.cocaine  prop_TRUE.heroin 
        0.8587571         0.7302632         0.5967742

Summarizing Categorical Data with tally()

Tabset

Tallying one variable

Often is it useful to summarise a categorical variable with a summary table listing the counts or propotions in each category.

tally( ~ homeless, data = HELPrct)

homeless
homeless   housed 
     209      244

tally( ~ homeless, data = HELPrct, format = "proportion")

homeless
 homeless    housed 
0.4613687 0.5386313

Cross tables

Cross tables are summary tables for two or more variables at once. (Cross tables with 2 variables are often called 2-way tables.)

Counts

tally( homeless ~ substance, data = HELPrct)

          substance
homeless   alcohol cocaine heroin
  homeless     103      59     47
  housed        74      93     77

Proportions

We can also express the values as proportions or percents rather than counts.

tally( ~ homeless | substance, data = HELPrct, format = 'prop')

          substance
homeless     alcohol   cocaine    heroin
  homeless 0.5819209 0.3881579 0.3790323
  housed   0.4180791 0.6118421 0.6209677

tally( ~ homeless | substance, data = HELPrct, format = 'percent')

          substance
homeless    alcohol  cocaine   heroin
  homeless 58.19209 38.81579 37.90323
  housed   41.80791 61.18421 62.09677

Cross Tables - with totals

Cross tables are summary tables for two or more variables at once. We can also add in the category total counts.

tally( ~ homeless | substance, data = HELPrct, margins=TRUE)

          substance
homeless   alcohol cocaine heroin
  homeless     103      59     47
  housed        74      93     77
  Total        177     152    124

tally( homeless ~ substance, data = HELPrct, margins = TRUE)

          substance
homeless   alcohol cocaine heroin
  homeless     103      59     47
  housed        74      93     77
  Total        177     152    124

tally( ~ homeless + substance, data = HELPrct, margins = TRUE)

          substance
homeless   alcohol cocaine heroin Total
  homeless     103      59     47   209
  housed        74      93     77   244
  Total        177     152    124   453

Formatting Tables in Markdown

pander()

The function tally() flexibly creates many simple tables and cross-tables, but when included in Rmarkdown files, they are not rendered as the best looking tables.

A simple solution is to use a pipe (%>%) to “send” the results from tally() to the function pander() (from package pander) for formatting. Compare:

library(pander)
tally( ~ homeless | substance, data = HELPrct, margins=TRUE)

          substance
homeless   alcohol cocaine heroin
  homeless     103      59     47
  housed        74      93     77
  Total        177     152    124

tally( ~ homeless | substance, data = HELPrct, margins=TRUE) %>%
  pander()

	alcohol	cocaine	heroin
homeless	103	59	47
housed	74	93	77
Total	177	152	124

Overview

Designing a visualization

Data visualizations are created using a palette of attributes or properties such as

position (x-position and y-position)
color
size
shape
transparency

The key to designing a visualization is deciding which of these attributes will be used to represent which variables in our data set.

Since every mark on our visualization must go somewhere, it is often good to start out by deciding what will be mapped to the x-axis and what will be mapped to the y-axis. Then we can bring in additional attributes to represent additional variables as needed.

We can take that approach as an outline for an anatomy of plots

Q v Q:
- scatterplots (possibly jittered) are the go to plot for looking at two quantiative variables.
Q v C:
- scatterplots – doesn’t work well except for small data sets
- jittered scatterplots – see sina plots below for another option
- boxplots – simple but crude plot based on a five-number summary
- violin plots – mirrored density plots that how more of the shape
- sina plots – combine well with violin plots, showing a dot for each observation
C v C:
- stacked bar plots
- dodged bar plots
- mosaic plots

For some plots one axis (typically y) is calculated based on the values of the variable on the other axis.

Q:
- histogram (or density histogram)
- density plot (filled)
- density plot (open)
- frequency polygon
- dot plot – advantage: there is a dot representing each observation; disadvantage: it can be tricky to get all the sizing just right to make these plots look good.
C (counts or proportions):
- bar chart

Each of these plots can be extended to show additional variables using color, shape, faceting, etc.

Scatter plots

Tabset

Basic

Basic Scatterplot

library(palmerpenguins)
gf_point(bill_length_mm ~ bill_depth_mm, data = penguins)

With color

Scatterplot with color

library(palmerpenguins)
gf_point(bill_length_mm ~ bill_depth_mm, data = penguins, color = ~ species)

With regression line(s)

Adding regression line(s)

library(palmerpenguins)
gf_point(bill_length_mm ~ bill_depth_mm, data = penguins, color = ~ species) %>%
  gf_lm()

With one categorical variable

Scatter plots with a categorical variable

Scatter plots can be made with categorial variables as well as quantitative variables, but often there is a lot of overplotting, making it difficult to interpret.

library(palmerpenguins)
gf_point(bill_length_mm ~ species, data = penguins)

This can often be improved by moving the dots left and right a bit (with no loss of information, since we can still tell what level each dot belongs to).

library(palmerpenguins)
gf_jitter(bill_length_mm ~ species, data = penguins, height = 0, width = 0.3)

gf_sina() moves things randomly left and write, but moves them farther when there is more data nearby. This helps show the shape of the distribution.

library(palmerpenguins)
gf_sina(bill_length_mm ~ species, data = penguins)

A smooth version of the outline is a called a violin plot.

library(palmerpenguins)
gf_violin (bill_length_mm ~ species, data = penguins) %>%
  gf_sina(bill_length_mm ~ species, data = penguins)

We can increase or decrease adjust from 1 to get a smoother or wigglier version of these plots. This is similar to adjusting the binwidth for a histogram.

library(palmerpenguins)
gf_violin (bill_length_mm ~ species, data = penguins, adjust = 2) %>%
  gf_sina(bill_length_mm ~ species, data = penguins, adjust = 2)

Histograms

Tabset

Simple

Code

gf_histogram( ~ age, data = HELPrct)

Plot

Setting Number of Bins

You typically need to experiment a bit with the number of bins (or the binwidth) to get a good view of the data.

Code

gf_histogram( ~ age, data = HELPrct, bins = 12)

Plot

Setting Bin Width

You typically need to experiment a bit with binwidth to get a good view of the data.

Code

gf_histogram( ~ age, data = HELPrct, binwidth = 5)

Plot

Density Histograms

A density histogram is a histogram that has been rescaled so that the total area is 1.

Code

gf_dhistogram( ~ age, data = HELPrct)

Plot

Density Histogram with Density Curve (fitted PDF)

Code - Normal fit

gf_dhistogram( ~ age, data = HELPrct) %>%
  gf_fitdistr(color = "red")

Plot - Normal fit

Code - Any distribution fit

gf_dhistogram( ~ age, data = HELPrct) %>%
  gf_fitdistr(dist = 'gamma', color = 'red')

Plot - Any distribution fit

Code - Any distribution with parameters of your choice

Two ways:

gf_dhistogram( ~ age, data = HELPrct) %>%
  gf_dist(dist = 'gamma', params = c(shape = 10, rate = 0.2), 
          color = "red")

gf_dhistogram( ~ age, data = HELPrct) %>%
  gf_dist(dist = 'gamma', shape = 10, rate = 0.2, color = "red")

Plot - Any distribution with parameters of your choice

Dot plots

Tabset

Basic

Basic dot plots

Dot plots are like histograms that stack up circles instead of rectangles. This makes it easier to see a representation of each observation, but it can take some fiddling to get the dot sizing just right for your plot. Also, the y-axis scale is not labeled very well for these plots.

gf_dotplot(~ age, data = HELPrct)

Tweaking

Tweaking dot plots

The 3 primary things to adjust are

size of the plot
size of the dots
binwidth

And since the y-axis labeling doesn’t actually do the right thing, we can turn that off.

gf_dotplot(~ age, data = HELPrct, dotsize = 0.5, binwidth = 1) %>%
  gf_theme(axis.text.y = element_blank())

Boxplots

Tabset

Side-by-side

Boxplots require two variables in ggformula.

Code

gf_boxplot( age ~ substance, data = HELPrct)

Plot

Sideways

Code

gf_boxplot( substance ~ age, data = HELPrct)

Plot

Single boxplot

You can trick gf_boxplot() into doing just one boxplot, but this isn’t useful very often. Other plots (like histograms or density plots) are usually better.

Code

gf_boxplot( age ~ " ", data = HELPrct)

Plot

Violin plots

Tabset

Side-by-side

Violin plots work just like boxplots but give a more detailed view of the data.

Code

gf_violin( age ~ substance, data = HELPrct)

Plot

Sideways

Code

gf_violin( substance ~ age, data = HELPrct)

Plot

Facets

Facets can be added to any of the plots. These examples use histograms.

Tabset

gf_facet_wrap() and gf_facet_grid()

You can control more options with gf_facet_wrap() and gf_facet_grid().

Code

gf_point( births ~ date, data = Births, color = ~ wday) %>%
  gf_facet_wrap(~ year, scales = "free_x") %>%
  gf_theme(axis.ticks.x = element_blank(),
           axis.text.x = element_blank())

Plot

Multi-layer plots

Layers can be stacked on top of each other to build up interesting plots.

Code

gf_violin( BMI ~ Race1, data = NHANES) %>%
  gf_jitter( BMI ~ Race1, data = NHANES, alpha = 0.1, color = "red", height = 0) %>%
  gf_hline(yintercept = ~30)

Plot

Code

gf_dhistogram( ~ BMI | Race1, data = NHANES, fill = "skyblue") %>%
  gf_dens(color = "navy") %>%
  gf_fitdistr(color = "red", linetype = "dashed")

Plot

Bar Graphs

Tabset

Simple

Code

library(NHANES)
gf_bar(~ TVHrsDay, data = NHANES)

Plot

Stacked

Code

Map the fill property to create stacked bars. Specify fill = ~ Name of your categorical variable:

library(NHANES)
gf_bar(~ TVHrsDay, fill = ~ Marijuana, data = NHANES)

Plot

Dodged

Code

To get the bars side-by-side instead of stacked, use the fill input and also position = 'dodge':

library(NHANES)
gf_bar( ~ TVHrsDay, fill = ~ Marijuana, data = NHANES, position = 'dodge')

Plot

Bar Graph (From Summary Table)

Tabset

Data

What if you have pre-summarized data (instead of one row per case, you have a variable that gives the categories and a variable that tells how many observations are in each category)? For example, like the dataset below (which is called NHANESsummary):

PregnantNow	Education	Number_of_Cases
Yes	9 - 11th Grade	1
Yes	High School	1
Yes	Some College	10
Yes	College Grad	7
No	8th Grade	16
No	9 - 11th Grade	49
No	High School	101
No	Some College	240
No	College Grad	214
Unknown	High School	1
Unknown	Some College	8
Unknown	College Grad	5

Simple Vertical

Code

gf_col() can make a bar graph from this summary data.

gf_col(Number_of_Cases ~ PregnantNow, data = NHANESsummary)

Plot

Horizontal

Two ways:

Code

gf_col(Number_of_Cases ~ PregnantNow, data = NHANESsummary) %>%
  gf_refine(coord_flip())

Plot

Code

gf_col(PregnantNow ~ Number_of_Cases, data = NHANESsummary)

Plot

Stacked

Code

gf_col(Number_of_Cases ~ PregnantNow, fill = ~ Education, data = NHANESsummary)

Plot

Dodged

Code

gf_col(Number_of_Cases ~ PregnantNow, fill = ~ Education, data = NHANESsummary, position = 'dodge')

Plot

Pie Charts

In ggformula, a pie chart is just a bar chart that using “polar coordinates”. The look much better if we turn off all the axis labeling with theme_void().

Remember, only use a pie chart if it visually delivers the messsage you want to send. A pie chart usually works best when there are only a few pie pieces (or one or two big ones and the rest very small). If the pie pieces are similar in size, it will be easier to tell which is bigger with a bar chart rather than a pie. One thing a pie chart can be good for is comparing a proportion to 0.5 since a half-circle has a straight line, which our eyes can detect well.

Create the Pie

gf_bar(~1, fill= ~ Race1, data = NHANES, width = 1)

gf_bar(~1, fill= ~ Race1, data = NHANES, width = 1)  %>% 
  gf_theme(theme_void()) %>%
  gf_refine(coord_polar('y'))

Overview

The name of the game

Regression is an old term that comes from an observation of an effect known as “regression toward the mean”. A more modern term for regression is a linear model, which is much more descriptive of what it is.

In a Simple Linear Model* (also called Simple Linear Regression**), there is one explanatory variable (x) and one response variable (y). The simple linear model has the form

\[ y_i = b_0 + b_1 x_i + \mbox{residual}_i \] The intercpet (\(b_0\)) and slope (\(b_1\)) are chosen so that the sum of the squares of the residuals (abbreviated SSR or SSE) is as small as possible. So another name for this line is the least squares regression line.

Always plot the data

A linear model is not always appropriate. Various plots of the data and the model can help us determine whether a linar model is a good idea.

gf_point(width ~ length, data = KidsFeet)

Fit with lm()

The linear model if fit by changing gf_point() to lm()

lm(width ~ length, data = KidsFeet)


Call:
lm(formula = width ~ length, data = KidsFeet)

Coefficients:
(Intercept)       length  
     2.8623       0.2479

Plot the data with the line

We can add the regression line to our plot with gf_lm():

gf_point(width ~ length, data = KidsFeet) %>%
  gf_lm()

Extractors

Tabset

Saving the model

It is handy to save the model with a name that can be used later to find out more about the model.

KF_model <- lm(width ~ length, data = KidsFeet)

Coefficients

We can get all the coeffiencients with coef()

coef(KF_model)

(Intercept)      length 
  2.8622761   0.2479478

Fitted values

We can get all the fitted values using fitted()

fitted(KF_model)

       1        2        3        4        5        6        7        8 
8.912201 9.160149 8.936996 9.110560 9.085765 9.234534 9.333713 8.565075 
       9       10       11       12       13       14       15       16 
8.713843 8.540280 9.680840 9.011381 9.333713 9.556866 9.308918 8.738638 
      17       18       19       20       21       22       23       24 
8.813022 8.986586 9.482481 9.184944 8.813022 8.912201 8.813022 8.936996 
      25       26       27       28       29       30       31       32 
8.862612 9.581660 9.333713 9.184944 8.862612 8.788228 8.813022 8.441101 
      33       34       35       36       37       38       39 
8.936996 8.713843 8.986586 8.540280 9.308918 8.217948 8.961791

To make predictions for explanatory variables of our own choosing, we can use makeFun().

predicted_width <- makeFun(KF_model)
predicted_width(22)

       1 
8.317127

predicted_width(23)

       1 
8.565075

predicted_width(25.5)

       1 
9.184944

Residuals

We can get the residuals for each observation using resid()

resid(KF_model)

          1           2           3           4           5           6 
-0.51220149 -0.36014925  0.76300373  0.68944030 -0.18576493  0.46546642 
          7           8           9          10          11          12 
 0.26628731  0.23492537  0.58615672  0.25972015  0.11916045 -0.11138060 
         13          14          15          16          17          18 
-0.23371269  0.24313433 -0.00891791 -0.83863806 -0.11302239 -0.18658582 
         19          20          21          22          23          24 
-0.48248134  0.31505597  0.38697761 -0.31220149 -0.51302239  0.06300373 
         25          26          27          28          29          30 
-0.76261194 -0.18166045  0.16628731  0.31505597  0.03738806  0.51177239 
         31          32          33          34          35          36 
 0.48697761  0.15889925 -0.33699627  0.28615672 -0.38658582 -0.04027985 
         37          38          39 
-0.30891791 -0.31794776 -0.16179104

Summary table

There are several ways to produce different types of summary tables for a linear model.

summary(KF_model)


Call:
lm(formula = width ~ length, data = KidsFeet)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.83864 -0.31056 -0.00892  0.27622  0.76300 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.8623     1.2081   2.369   0.0232 *  
length        0.2480     0.0488   5.081  1.1e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3963 on 37 degrees of freedom
Multiple R-squared:  0.411, Adjusted R-squared:  0.3951 
F-statistic: 25.82 on 1 and 37 DF,  p-value: 1.097e-05

msummary(KF_model)  # slightly more minimal version of the table

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.8623     1.2081   2.369   0.0232 *  
length        0.2480     0.0488   5.081  1.1e-05 ***

Residual standard error: 0.3963 on 37 degrees of freedom
Multiple R-squared:  0.411, Adjusted R-squared:  0.3951 
F-statistic: 25.82 on 1 and 37 DF,  p-value: 1.097e-05

library(broom)
tidy(KF_model)      # even terser table (in data frame format)

# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    2.86     1.21        2.37 0.0232   
2 length         0.248    0.0488      5.08 0.0000110

Residual Plots

Residuals vs Explanatory Variable

gf_point(resid(KF_model) ~ length, data = KidsFeet)

Residuals vs Fits

gf_point(resid(KF_model) ~ fitted(KF_model), data = KidsFeet)

Histogram of residuals

gf_histogram( ~ resid(KF_model), data = KidsFeet)

plot() and mplot()

plot() and mplot() can create several kinds of plots of our model, including our residual vs fits plot. mplot() uses the same system and our ggformula plots and also uses space a bit more efficiently.

plot(KF_model, which = 1)

mplot(KF_model, which = 1)

Randomization Distributions

Tabset

1 proportion

Main Template

Fill in the blanks:

do( ____ ) * rflip( ____, prob = _____)

\(H_0: p = 0.5\)

The Lady Tasting Tea is an example.

\(H_0\): p = 0.5 (The lady is just guessing.)
\(H_0\): p > 0.5 (The lady is doing better than just guessing.)

Let’s suppose she gets 9 out of 10 correct (\(\hat{p} = 0.90\)). We can use rflip() to generate random guesses and see how often a random guesser gets at least 9 out of 10 correct.

set.seed(12345)
Lady_null <- do(1000) * rflip(10)
gf_histogram(~heads, data = Lady_null, binwidth = 1)

tally( ~ (heads >= 9), data = Lady_null)

(heads >= 9)
 TRUE FALSE 
   17   983

tally(~ (prop >= 0.90), data = Lady_null)

(prop >= 0.9)
 TRUE FALSE 
   17   983

So our p-value is 17/1000 = 0.017. Since this is quite small, we reject the null hypothesis. We have evidence that she is doing better than just guessing.

Testing other proportions

Here’s an example based on the game Mitternachtsparty which uses a die with a ghost (Hugo) on it. Let’s test

\(H_0: p = 1/6\)
\(H_a: p > 1/6\)

Data: 16 Hugos in 50 tosses of the die.

To test other proportions, we can use an additional argument to rflip(), namely prob.

rflip(50, prob = 1/6)


Flipping 50 coins [ Prob(Heads) = 0.166666666666667 ] ...

T T T T T T T T T T T H T T T T T T T H T T T T T H T T T T T T T T T T
H T T H T T H T H T T T T T

Number of Heads: 7 [Proportion Heads: 0.14]

set.seed(12345)
Hugo_null <- do(1000) * rflip(50, prob = 1/6)
gf_histogram(~heads, data = Hugo_null, binwidth = 1, fill = ~(heads >= 16))

tally( ~ (heads >= 16), data = Hugo_null)

(heads >= 16)
 TRUE FALSE 
    2   998

tally(~ (prop >= 16/50), data = Hugo_null)

(prop >= 16/50)
 TRUE FALSE 
    2   998

# p-value
prop(~ (prop >= 16/50), data = Hugo_null)

prop_TRUE 
    0.002

2 proportions

Main Template

Fill in the blanks:

do( _____ ) * diffprop( _____ ~ shuffle(_____), data = _____ )

Loading some data

Malaria <- read.csv('https://rpruim.github.io/s145/data/malaria.csv')
head(Malaria, 3)    # see what the first few rows look like

      group malaria
1   placebo     yes
2   placebo     yes
3 treatment     yes

Inspecting the Data

tally(malaria ~ group, data = Malaria)

       group
malaria placebo treatment
    no        0         9
    yes       6         5

tally(malaria ~ group, data = Malaria , format = "prop")

       group
malaria   placebo treatment
    no  0.0000000 0.6428571
    yes 1.0000000 0.3571429

gf_props( ~ malaria | group, data = Malaria)

gf_props( ~ group, fill = ~malaria, data = Malaria, denom = ~x)

Observed Test Statistic

test_stat <- diffprop( malaria ~ group, data = Malaria)
test_stat

 diffprop 
0.6428571

Generating the Randomziation Distribution

set.seed(12345)
Malaria_null <- do(2000) * diffprop(malaria ~ shuffle(group), data = Malaria)
gf_histogram(~ diffprop, data = Malaria_null, fill = ~(diffprop >= test_stat))

# 1-sided p-value
prop(~ (diffprop >= test_stat), data = Malaria_null)

prop_TRUE 
   0.0055

# 2-sided p-value
2 * prop(~ (diffprop >= test_stat), data = Malaria_null)

prop_TRUE 
    0.011

2 means

Main Template

Fill in the blanks:

do( _____ ) * diffmean( _____ ~ shuffle(_____), data = _____ )

Inspecting the data

df_stats(Taps ~ Group, data = CaffeineTaps)

  response       Group min    Q1 median     Q3 max  mean       sd  n missing
1     Taps    Caffeine 245 246.5  248.0 250.00 252 248.3 2.213594 10       0
2     Taps No Caffeine 242 242.5  244.5 246.75 248 244.8 2.394438 10       0

gf_boxplot(Taps ~ Group, data = CaffeineTaps) %>%
  gf_jitter(width = 0.2, height = 0.2, size = 2)

For a larger data set, gf_violin(), gf_density() or gf_histogram() might be useful.

set.seed(12345)
test_stat <- diffmean(Taps ~ Group, data = CaffeineTaps); test_stat

diffmean 
    -3.5

Taps_null <- do(2000) * diffmean(Taps ~ shuffle(Group), data = CaffeineTaps)
gf_histogram( ~ diffmean, data = Taps_null) %>%
  gf_vline(xintercept = ~test_stat)

# p-value (1-sided)
prop( ~ (diffmean <= test_stat), data = Taps_null)

prop_TRUE 
    0.004

Bootstrap Overview

Tabset

Why we use the bootstrap

If only we had the sampling distribution

To create a confidence interval, we would like to know

the shape and
spread

of the sampling distribution.

Bootstrap distribution is the next best thing

The bootstrap allows us to approximate the sampling distribution by

taking many sample (with replacement) from our original sample,
computing our statistic of interest for each bootstrap sample,
using the distribution of these statistics as a stand-in for the sampling distribution.

How well does it work?

This approximation works better when sample sizes are larger (because we get a better representation of the poulation).
It also works better for some parameter/statistic combinations than for others.
In any case, it is always important that we have a high quality random sample to start from.

How we use the bootstrap

The basic steps are the same each time.

Look at your data! (This is true no matter what you are going to do with it.)

* Create some graphical or numerical summaries.
* Look for anything unusual or unexpected (and figure out what's up
before proceeding)

Compute the statistic of interest using the original sample.

`estimate()` is a function like `mean()`, `median()`, `prop()`, `diffmean()`, `df_stats()`, ..., 
that computes your sample statistic of interest -- your estimate.

estimate( ~ variable, data = original_data)

Compute the statistic of interest for a bootstrap sample by resampling from the original data.

estimate( ~ variable, data = resample(original_data))

Repeat that lots of times. (1000 - 10,000 is usually a good amount.) Be sure to save the results!

Bootstrap <-
  do (1000) * estimate( ~ variable, data = resmaple(original_data))

Inspect the results

glimpse(Bootstrap)    # see what it looks like and what the variables are called.
gf_histogram( ~ statistic_of_interest, data = Bootstrap)   # histogram of bootstrap distribution

The bootstrap distribution in `Bootstrap` can now be used to estimate standard error or to compute
confidence intervals.  See the other tabs for specific examples.

The Bootstrap Template

Creating the Bootstrap distribution

A bootstrap distribution is the distribution of some estimate over many resamples from our data. Most often we will compute the bootstrap distribution using the following fill-in-the-blanks template:

# Compute estimate from the data
estimate(____, data = _____)
# Create a bootstrap distribution
Bootstrap <- do(____) * estimate(____, data = resample(_____))

estimate() should be replaced with a function that computes our estimate. Examples: mean(), sd(), prop(), df_stats(), etc.

Bootstrap Percentile Intervals

To get a 95% percentile confidence interval, we take the central 95% of the bootstrap distribution using cdate():

cdata( ~ _______, data = Bootstrap)

look at your bootstrap distribution to see what the variable is called in that data set and use that name to fill in the blank.

Bootstrap SE Intervals

We can estimate the standard error by taking the standard deviation of our bootstrap distribution. If the bootstrap distribution is approximately normal, then our interval has the form

\[ \mathrm{estimate} \pm 2 SE \]

Note: 2 is correct to 1 decimal place. To two decimals, we should use \(\mathrm{estimate} \pm 1.96 SE\)

qnorm(0.975)

[1] 1.959964

Bootstrap CIs

Tabset

1 Quantitative Variable

The most common thing to estimate for a quantitative variable is the mean, but we could estimate other things like the standard deviation, the median, or some other quantile.

NHANES Example

library(mosaic)
library(NHANES)
Men50s <- NHANES %>% filter(Age >= 50, Age <= 59, Gender == "male")
estimate <- mean( ~Pulse, data = Men50s, na.rm = TRUE); estimate

[1] 69.85106

df_stats( ~Pulse, data = Men50s, mean_pulse = mean)

  response mean_pulse
1    Pulse   69.85106

Pulse_boot <- 
  do(1000) * df_stats( ~Pulse, data = resample(Men50s), mean_pulse = mean)

Plot

gf_histogram( ~ mean_pulse, data = Pulse_boot)

Percentile CI

cdata( ~mean_pulse, data = Pulse_boot, p = 0.95)

        lower    upper central.p
2.5% 68.97708 70.75522      0.95

SE CI

SE <- sd(~mean_pulse, data = Pulse_boot)
estimate - 2 * SE

[1] 68.91254

estimate + 2 * SE

[1] 70.78959

1 Categorical Variable

Using Raw Data

tally( ~Smoke100, data = Men50s)

Smoke100
 No Yes 
299 382

prop( ~Smoke100, data = Men50s)

  prop_No 
0.4390602

estimate <- prop( ~Smoke100, data = Men50s, success = "Yes"); estimate

 prop_Yes 
0.5609398

Smoke_boot <- do(1000) * prop( ~Smoke100, data = resample(Men50s), success = "Yes")
gf_histogram( ~prop_Yes, data = Smoke_boot)

cdata( ~ prop_Yes, data = Smoke_boot)

         lower     upper central.p
2.5% 0.5256975 0.5991189      0.95

The SE interval is similar:

SE <- sd( ~ prop_Yes, data = Smoke_boot); SE

[1] 0.01855384

estimate - 2 * SE

 prop_Yes 
0.5238321

estimate + 2 * SE

 prop_Yes 
0.5980475

Using Summarized Data

Categorical data is easy to summarise. We can use rflip() to work with summarized data instead of raw data.

tally( ~Smoke100, data = Men50s)

Smoke100
 No Yes 
299 382

299 + 382

[1] 681

Smoke_boot2 <- do(1000) * rflip(681, 382/681)
gf_histogram( ~prop, data = Smoke_boot2)

cdata( ~ prop, data = Smoke_boot2)

         lower     upper central.p
2.5% 0.5212922 0.5962188      0.95

SE <- sd( ~ prop, data = Smoke_boot2); SE

[1] 0.01932551

299 / 681 - 2 * SE

[1] 0.4004092

299 / 681 + 2 * SE

[1] 0.4777112

Difference in Means

NHANES example

How different are the mean pulse for men and for women in their 20s?

In our sample, the mean pulse for women is about 4.5 beats per minute faster.

NHANES20s <- NHANES %>% filter(Age >= 20, Age <= 29)
df_stats(Pulse ~ Gender, data = NHANES20s, mean)

  response Gender     mean
1    Pulse female 77.00307
2    Pulse   male 72.56012

diffmean(Pulse ~ Gender, data = NHANES20s, na.rm = TRUE)

 diffmean 
-4.442946

Now let’s create a bootstrap distribution to see how precise that estimate is.

Pulse_boot <- do(1000) * diffmean(Pulse ~ Gender, data = resample(NHANES20s), na.rm = TRUE)
gf_histogram( ~ diffmean, data = Pulse_boot)

cdata( ~ diffmean, data = Pulse_boot)

         lower     upper central.p
2.5% -5.659382 -3.246555      0.95

The SE interval is similar:

SE <- sd( ~ diffmean, data = Pulse_boot); SE

[1] 0.6119071

diffmean(Pulse ~ Gender, data = NHANES20s, na.rm = TRUE)

 diffmean 
-4.442946

Since this interval is entirely below 0, we have evidence that the men’s mean pulse is slower than the women’s (but not by a lot since the interval doesn’t stretch very far from 0).

Difference in Proportions

NHANES example

How different are the smoking rates for men and for women in their 20s?

In our sample, men are a bit more than 11% more likely to smoke that women are.

NHANES20s <- NHANES %>% filter(Age >= 20, Age <= 29)
tally(Smoke100 ~ Gender, data = NHANES20s, margins = TRUE, format = 'prop')

        Gender
Smoke100    female      male
   No    0.6725404 0.5600000
   Yes   0.3274596 0.4400000
   Total 1.0000000 1.0000000

diffprop(Smoke100 ~ Gender, data = NHANES20s)

  diffprop 
-0.1125404

Now let’s create a bootstrap distribution to see how precise that estimate is.

Smoke_boot <- do(1000) * diffprop(Smoke100 ~ Gender, data = resample(NHANES20s))
gf_histogram( ~ diffprop, data = Smoke_boot)

cdata( ~ diffprop, data = Smoke_boot)

         lower       upper central.p
2.5% -0.164021 -0.05981467      0.95

A standard error interval is similar:

SE <- sd( ~ diffprop, data = Smoke_boot); SE

[1] 0.02662732

diffprop(Smoke100 ~ Gender, data = NHANES20s) - 2 * SE

 diffprop 
-0.165795

diffprop(Smoke100 ~ Gender, data = NHANES20s) + 2 * SE

   diffprop 
-0.05928574

Resampling within Groups

If we do our resampling as in the previous example, we are likely to get a different number of cases in each group in our resamples compared to the original data (and to teach other).

tally( ~ Gender, data = NHANES20s)

Gender
female   male 
   681    675

tally( ~ Gender, data = resample(NHANES20s))

Gender
female   male 
   652    704

tally( ~ Gender, data = resample(NHANES20s))

Gender
female   male 
   687    669

We might prefer to do our resampling separately for each of the two groups. There are at least two reasons why we might prefer to resample within groups:

If our study design fixed the number in each group, we can do our resampling separately from within each of the two groups to reflect the design of the study.
If at least one of our groups is small in our sample is small, there is a chance resampling from the full sample will sometimes give us a resample with no data in one of the groups. When that happens, we can’t compare the groups.

Some people resample within groups as their primary resampling method. (Our text describes resampling for differences in means and proportions that way, for example.) So you could do it that way all the time. But it is important to to resample within groups if either of the two conditions above apply to our situation.

We will illustrate resampling within groups using our comparison of smoking rates for men and women, even though in this study, no attempt was made to predetermine how many men and women were in the study. All we need to do is add a groups argument to our resample() command to tell R what variable determines the groups.

tally( ~ Gender, data = NHANES20s)

Gender
female   male 
   681    675

tally( ~ Gender, data = resample(NHANES20s, groups = Gender))

Gender
female   male 
   681    675

tally( ~ Gender, data = resample(NHANES20s, groups = Gender))

Gender
female   male 
   681    675

Smoke_boot <- do(1000) * diffprop(Smoke100 ~ Gender, data = resample(NHANES20s, groups = Gender))
gf_histogram( ~ diffprop, data = Smoke_boot)

cdata( ~ diffprop, data = Smoke_boot)

          lower       upper central.p
2.5% -0.1612916 -0.06231702      0.95

We can also resample within groups when looking at a difference in means.

Paired Data

Paired data

Sometimes we have two measurements for each case in our data and we want to compare them. Pre- and post- measurements (before and after some intervention or some amount of time) are a common situation where this occurs. If we are interested in the amount of change, we can compute the difference or ratio between these two measurements to get a single variable measuring the change. At that point, we are back to having one quantitative variable and everything procedes as before.

A pre-calculated example

A 2004 paper by Gettelfinger and Cussler investigated whether people can swim faster or slower in syrup vs water. They timed 18 swimmers of various abilities in regular water and in “syrup” (water thickened with guar gum). The reported their data as a ratio of times in syrup over times in water (so a number greater than 1 means slower in syrup and a number less than 1 means faster in syrup).

Pairing each swimmer with himself/herself was designed to compensate for different swimming abilities. On average, swimmers were 1.2% slower in syrup.

library(abd)
head(SyrupSwimming)

  relative.speed
1           1.08
2           0.94
3           1.07
4           1.04
5           1.02
6           1.01

df_stats(~relative.speed, data = SyrupSwimming, mean_rel.speed = mean)

        response mean_rel.speed
1 relative.speed       1.011667

Let’s get a 95% confidence interval for that estimate.

Syrup_boot <- 
  do(1000) * df_stats(~relative.speed, data = resample(SyrupSwimming), mean_rel.speed = mean)
gf_histogram(~mean_rel.speed, data = Syrup_boot)

cdata(~mean_rel.speed, data = Syrup_boot, p = 0.95)

         lower    upper central.p
2.5% 0.9922222 1.031111      0.95

The SE interval is similar:

SE <- sd(~mean_rel.speed, data = Syrup_boot); SE

[1] 0.009617349

mean(~relative.speed, data = SyrupSwimming) - 2 * SE

[1] 0.992432

mean(~relative.speed, data = SyrupSwimming) + 2 * SE

[1] 1.030901

Our intervals suggests that it is plausible that the syrup times are between 1% faster and 3% slower than the water times. If there is a difference, it doesn’t seem to be very large, and it isn’t certain which is faster, swimming in syrup or in water.

This example goes back to an old conversation between physicists Isaac Newton and Christiaan Huygens about how viscosity of a liquid would affect an objects speed through the liquid. The paper by Gettelfinger and Cussler sparked renewed interest in this topic:

Creating a new variable

If your raw data has the two measurements not yet combined, the first step is to create a new variable. We can do that with the mutate() function.

The ucla_textbooks_f18 data set has prices for textbooks at the campus bookstore and at Amazon. Let’s see how different the prices are on average. Since different books cost different amounts, we want to compare each books price in the bookstore to that books price on Amazon. That’s a paired design.

We could use subtraction for absolute differences or division for percentage differences. Which makes more sense depends a bit on what you believe about pricing: Do you expect roughly the same difference for books in all price ranges, or do you expect more difference when books are more expensive? We can also use the data to investigate which of these seems to be the case.

library(openintro)
glimpse(ucla_textbooks_f18)

Rows: 201
Columns: 20
$ year             <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018…
$ term             <fct> Fall, Fall, Fall, Fall, Fall, Fall, Fall, Fall, Fall…
$ subject          <fct> American Indian Studies, Anthropology, Art, Arts and…
$ subject_abbr     <fct> AM IND, ANTHRO, NA, ART&ARC, NA, ASTR, BMD RES, CHEM…
$ course           <fct> "Introduction to American Indian Studies", "Archaeol…
$ course_num       <fct> M10, 2, 11D, 10, M60W, 4, 5HA, 14A, 14B, M5A, 10, 1E…
$ course_numeric   <int> 10, 2, 11, 10, 60, 4, 5, 14, 14, 5, 10, 1, 2, 9, 10,…
$ seminar          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ ind_study        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ apprenticeship   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ internship       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ honors_contracts <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ laboratory       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ special_topic    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ textbook_isbn    <chr> "9781138477858", "9780307741806", NA, "9780979757549…
$ bookstore_new    <dbl> 47.97, 14.26, NA, 13.50, 49.26, 119.97, NA, 188.10, …
$ bookstore_used   <dbl> 44.97, 10.96, NA, 11.00, 43.26, 67.00, NA, 149.00, 1…
$ amazon_new       <dbl> 47.45, 13.55, NA, 12.53, 54.95, 124.80, NA, NA, NA, …
$ amazon_used      <dbl> 51.20, 7.10, NA, NA, 24.83, 47.75, NA, NA, 90.00, NA…
$ notes            <fct> , , NA, , NA, NA, NA, , New version only from resell…

gf_point((bookstore_new - amazon_new) ~ amazon_new, data = ucla_textbooks_f18)

It does look like the price swings are larger for more expensive books, so let’s use a ratio. (Note also the 133 books are missing at least one price. We will remove those below.)

Books <-
  ucla_textbooks_f18 %>%
  mutate(price_ratio = bookstore_new / amazon_new) %>%
  filter(!is.na(price_ratio))
df_stats(~price_ratio, data = Books, mean_ratio = mean)

     response mean_ratio
1 price_ratio   1.065878

Looks like bookstrore prices are 6% higher on average. Let’s create a confidence interval for that ratio.

Books_boot <-
  do(1000) * df_stats(~price_ratio, data = resample(Books), mean_ratio = mean)
gf_histogram(~ mean_ratio, data = Books_boot)

cdata(~ mean_ratio, data = Books_boot)

       lower    upper central.p
2.5% 1.02792 1.111962      0.95

Our 95% percentile confidence interval is between 2.8% and 11.2% higher for the average price of a bookstore book vs. purchasing the book on Amazon.

Because the bootstrap distribution is a bit skewed, we won’t compute the SE confidence interval for this example.

More Examples

Tabset

CI for a mean using `df_stats()`

Creating the Bootstrap distribution

# mean body temp in actual data (50 students)
df_stats( ~ BodyTemp, data = BodyTemp50, mean)

  response  mean
1 BodyTemp 98.26

# mean body temp in 1 bootstrap sample
#    * use resample(BodyTemp50) instead of BodyTemp50
df_stats( ~ BodyTemp, data = resample(BodyTemp50), mean)

  response   mean
1 BodyTemp 98.222

# do that 3 times (so we can see what the result looks like)
do(3) * df_stats( ~ BodyTemp, data = resample(BodyTemp50), mean)

  response   mean .row .index
1 BodyTemp 98.170    1      1
2 BodyTemp 98.344    1      2
3 BodyTemp 98.306    1      3

# now generate 1000 bootstrap statistics and save the results

BodyTemp.boot <-
  do(1000) * df_stats( ~ BodyTemp, data = resample(BodyTemp50), mean_BodyTemp = mean)

Looking at the bootstrap distribution

head(BodyTemp.boot)

  response mean_BodyTemp .row .index
1 BodyTemp        98.196    1      1
2 BodyTemp        98.158    1      2
3 BodyTemp        98.114    1      3
4 BodyTemp        98.394    1      4
5 BodyTemp        98.266    1      5
6 BodyTemp        98.304    1      6

gf_dhistogram( ~ mean_BodyTemp, data = BodyTemp.boot) %>%
  gf_fitdistr(color = "red")   # overlay a normal distribution for comparison

SE Confidence Interval

This method is only accurate if the bootstrap distribution is approximately normal and our confidence level is 95%.

# could use df_stats() here, but is easier to do arithmetic this way
SE <- sd( ~ mean_BodyTemp, data = BodyTemp.boot); SE

[1] 0.1065861

CI <- mean( ~ BodyTemp, data = BodyTemp50) + c(-1, 1) * 2 * SE
CI

[1] 98.04683 98.47317

Percentile Confidence Interval

This method will work better if the bootstrap distribution is skewed or has heavier or lighter tails than a normal distribution. It is also easier to adjust this method for use with other confidence levels.

# 95% CI
cdata( ~ mean_BodyTemp, data = BodyTemp.boot, p = 0.95)

        lower upper central.p
2.5% 98.05195 98.47      0.95

# 99% CI
cdata( ~ mean_BodyTemp, data = BodyTemp.boot, p = 0.99)

        lower    upper central.p
0.5% 98.00595 98.52005      0.99

CI for a mean using `mean()`

Generating the bootstrap distribution

# mean body temp in actual data (50 students)
mean( ~ BodyTemp, data = BodyTemp50)

[1] 98.26

# mean body temp in 1 bootstrap sample
#    * use resample(BodyTemp50) instead of BodyTemp50
mean( ~ BodyTemp, data = resample(BodyTemp50))

[1] 98.432

# do that 3 times (so we can see what the result looks like)
do(3) * mean( ~ BodyTemp, data = resample(BodyTemp50))

# now generate 1000 bootstrap statistics and save the results

BodyTemp.boot <-
  do(1000) * mean( ~ BodyTemp, data = resample(BodyTemp50))

Looking at the bootstrap distribution

head(BodyTemp.boot)

gf_dhistogram( ~ mean, data = BodyTemp.boot) %>%
  gf_fitdistr(color = "red")

The plot

Confidence Intervals

SE method

This method is only accurate if the bootstrap distribution is approximately normal and our confidence level is 95%.

SE <- sd( ~ mean, data = BodyTemp.boot)
CI <- mean( ~ mean, data = BodyTemp.boot) + c(-1,1) * 2 * SE
CI

[1] 98.04419 98.47793

Percentile method

# 95% CI
cdata( ~ mean, data = BodyTemp.boot, p = 0.95)

        lower    upper central.p
2.5% 98.05595 98.48405      0.95

# 99% CI
cdata( ~ mean, data = BodyTemp.boot, p = 0.99)

      lower    upper central.p
0.5% 98.006 98.57203      0.99

CI for one proportion

What percentage of Americans have BMI at least 30?

Generating bootstrap distribution

require(NHANES)
prop( ~ BMI >= 30, data = NHANES)

prop_TRUE 
0.2871082

BMI.boot <-
  do(1000) * prop( ~ BMI >= 30,
                   data = resample(NHANES))
head(BMI.boot, 3)

  prop_TRUE
1 0.2909921
2 0.2877496
3 0.2884436

Looking at the boostrap distribution

head(BMI.boot)

  prop_TRUE
1 0.2909921
2 0.2877496
3 0.2884436
4 0.2898641
5 0.2822505
6 0.2888173

gf_dhistogram(~ prop_TRUE, data = BMI.boot)

The plot

  prop_TRUE
1 0.2909921
2 0.2877496
3 0.2884436
4 0.2898641
5 0.2822505
6 0.2888173

95% CI (SE method)

We can use this method if the bootstrap distribution is pretty unimodal, symmetric, and bell-shaped.

observed_prop <- prop( ~ BMI >= 30, data = NHANES)
observed_prop

prop_TRUE 
0.2871082

SE <- sd( ~ prop_TRUE, data = BMI.boot)
SE

[1] 0.00464423

CI <- observed_prop + c(-1,1) * 2 * SE
CI

[1] 0.2778197 0.2963966

95% CI (percentile method)

cdata( ~ prop_TRUE, data = BMI.boot, p = 0.95)

       lower     upper central.p
2.5% 0.27836 0.2958608      0.95

Normal Distribution

Tabset

Computing probabilities

To compute a probability from an x-value, we can use pnorm() (or, for output including a figure as well as just the numeric value, xpnorm()). These functions report the area to the left of the specified x-value under a Normal curve with a certain mean and sd:

Example

For example, find the probability of getting a value less than 0 from a normal distribution with mean 3 and standard deviation 5:

xpnorm(0, mean=3, sd=5)

[1] 0.2742531

If we add the input lower.tail=FALSE, then the area above the stated x-value will be computed instead:

xpnorm(0, mean=3, sd=5, lower.tail=FALSE)

[1] 0.7257469

Finding Quantiles from Probabilities

We can use function qnorm() (or xqnorm()) to find x-values corresponding to probabilites.

CI example

For example, if we want to find the lower bound of a 95% CI, knowing that the relevant sampling distribution is well modelled by a normal distribution centered at 14.5 with a standard error of 2.7:

xqnorm(0.025, mean=14.5, sd=2.7)

[1] 9.208097

The upper bound of the same CI, two ways:

xqnorm(0.025, mean=14.5, sd=2.7, lower.tail=FALSE)

[1] 19.7919

xqnorm(0.975, mean=14.5, sd=2.7)

[1] 19.7919

Binomial Distribution

Tabset

Densities

In a scenario where we are considering a random variable \(X\) that has a binomial distribution with number of trials \(n\) and probability of success \(p\), we can compute the probability density at \(k\) (the chance of getting exactly \(k\) successes) as:

\[ P(X=k) = {n \choose k} p^k (1-p)^{(n-k)}\] Or, in R:

dbinom(k, size=n, prob=p)

For example, imagine an airline flies planes with 120 seats, but they sell 125 tickets for each flight, over booking them all! Each passenger has a 0.88 chance of showing up and flying (and a 0.12 chance of missing the flight or being a no-show). The situation can be modelled with a binomial distribution (the passengers can be considered independent trails). We want to find the probability that 124 people show up for a flight. Here, \(n=125\) and \(p=0.88\), so we want to find

\[ P(X=124) = {125 \choose 124} 0.88^{124} (1-0.88)^{(125-124)}\]

In R,

dbinom(124, size=125, prob=0.88)

[1] 1.958586e-06

If we wanted to figure out the probability that more than 120 people would show up, we could find the (mutually exclusive) probabilities that \(k\) is 121, 122, 123, 124, and 125 and add them up. Or… (see the “Probabilities” tab for a shortcut…)

Probabilities

In a scenario where we are considering a random variable \(X\) that has a binomial distribution, with number of trials \(n\) and probability of success \(p\), we can compute the probability density at \(k\) (the chance of getting exactly \(k\) successes) as:

\[ P(X=k) = {n \choose k} p^k (1-p)^{(n-k)}\]

It follows that the probability of X being \(k\) or less is:

\[ P(X \leq K) = \sum_{k=0}^K {n \choose k} p^k (1-p)^{(n-k)}\]

This is the discrete version of finding “the area under the distribution curve” for \(X \leq k\). We can use R function pbinom() to compute it:

pbinom(K, size=n, prob=p)

For example, imagine an airline flies planes with 120 seats, but they sell 125 tickets for each flight, over-booking them all! Each passenger has a 0.88 chance of showing up and flying (and a 0.12 chance of missing the flight or being a no-show). The situation can be modelled with a binomial distribution (the passengers can be considered independent trails). We want to find the probability that more than 120 show up for a flight. Here, \(n=125\) and \(p=0.88\), and we need to compute

\[ 1 - P(X \leq 120) = 1 - \sum_{k=0}^{120} {n \choose k} p^k (1-p)^{(n-k)}\]

(The “1-” is because we are computing the probability of getting no more than 120 passengers on the 120-seat flight, and we were asked to compute the probability of getting more than 120 passengers.)

In R, we would compute:

1 - pbinom(120, size=125, prob=0.88)

[1] 0.000496259

Or alternatively, we could find \(P(X > K)\) instead of \(P(X \leq K)\) by adding input lower.tail=FALSE, and find the same answer via:

pbinom(120, size=125, prob=0.88, lower.tail=FALSE)

[1] 0.000496259