Main Template
Fill in the blanks:
This page provides a number of mini-examples demonstrating how to do specific tasks. Use the menu to select the topic you are interested in to view an example related to that topic.
Note: Depending on your broswer, you may need to scroll to see all of the menu items.
To use a data set in a package, you must either load the entire package or select just the data set you need.
library(mosaicData) # load a package (and all its datasets)
data(KidsFeet, package = "mosaicData") # grab just one data set from a package
find("KidsFeet") # see if there is a loaded package w/ KidsFeet
[1] ".GlobalEnv" "package:mosaicData"
To use a dataset in an R Markdown document, you must load the package in the R Markdown document. A good place for this is in the setup chunk.
Some packages, including oibiostat
require you to use data()
any time you want to load a data set, but most will make all their data sets available as soon as you load the package.
[1] "region" "type" "enrollment"
[4] "property_crimes" "violent_crimes" "property_crime_rate"
[7] "violent_crime_rate" "overall_crime"
In RStudio you can also use “Import Dataset” in the Environment pane.
library(readxl) # or require(readxl)
Crime2 <- read_excel("../../data/CampusCrime.xlsx", sheet = 1)
names(Crime2)
[1] "region" "type" "enrollment"
[4] "property_crimes" "violent_crimes" "property_crime_rate"
[7] "violent_crime_rate" "overall_crime"
In RStudio you can also use “Import Dataset” in the Environment pane.
If you have google in a Google Spreadsheet, you have some options.
You can export the file as a CSV or excel file from Google and then import that into R.
You can also load data directly from Google. First, in the File menu on Google Sheets, choose “Publish to the Web”. Publish as a CSV file. Read in the file by providing the resulting URL (in quotes) to read.csv()
.
There is also a googlesheets4
package which provides direct access to Google spreadsheets. The details for using Google sheets change from time to time, so it’s best to consult the googlesheets4
package documentation.
There are a number of different ways to you can inspect your data to learn about it.
Rows: 81
Columns: 8
$ region <chr> "SE", "SE", "W", "W", "SW", "SW", "W", "W", "C", …
$ type <chr> "U", "C", "U", "C", "U", "U", "U", "C", "U", "C",…
$ enrollment <dbl> 5590, 540, 35747, 28176, 10568, 3127, 20675, 1254…
$ property_crimes <dbl> 266, 10, 1233, 210, 116, 29, 284, 43, 660, 94, 93…
$ violent_crimes <dbl> 30, 0, 23, 1, 1, 0, 7, 0, 19, 4, 3, 6, 26, 23, 7,…
$ property_crime_rate <dbl> 475.84973, 185.18519, 344.92405, 74.53152, 109.76…
$ violent_crime_rate <dbl> 53.6672630, 0.0000000, 6.4341064, 0.3549120, 0.94…
$ overall_crime <dbl> 296, 10, 1256, 211, 117, 29, 291, 43, 679, 98, 96…
categorical variables:
name class levels n missing
1 region character 6 81 0
2 type character 2 81 0
distribution
1 NE (25.9%), C (21%), SE (18.5%) ...
2 U (65.4%), C (34.6%)
quantitative variables:
name class min Q1 median
...1 enrollment numeric 540.000000 4638.0000000 11321.0000
...2 property_crimes numeric 8.000000 64.0000000 120.0000
...3 violent_crimes numeric 0.000000 1.0000000 3.0000
...4 property_crime_rate numeric 8.639309 94.2211055 141.9499
...5 violent_crime_rate numeric 0.000000 0.7491759 2.4888
...6 overall_crime numeric 9.000000 65.0000000 127.0000
Q3 max mean sd n missing
...1 22396.000000 46597.00000 13899.308642 10501.866624 81 0
...2 293.000000 1293.00000 240.790123 270.200653 81 0
...3 8.000000 30.00000 5.938272 7.404974 81 0
...4 208.581416 475.84973 160.981380 97.971006 81 0
...5 6.434106 53.66726 4.978590 7.906354 81 0
...6 301.000000 1311.00000 246.728395 275.486389 81 0
region type enrollment property_crimes
Length:81 Length:81 Min. : 540 Min. : 8.0
Class :character Class :character 1st Qu.: 4638 1st Qu.: 64.0
Mode :character Mode :character Median :11321 Median : 120.0
Mean :13899 Mean : 240.8
3rd Qu.:22396 3rd Qu.: 293.0
Max. :46597 Max. :1293.0
violent_crimes property_crime_rate violent_crime_rate overall_crime
Min. : 0.000 Min. : 8.639 Min. : 0.0000 Min. : 9.0
1st Qu.: 1.000 1st Qu.: 94.221 1st Qu.: 0.7492 1st Qu.: 65.0
Median : 3.000 Median :141.950 Median : 2.4888 Median : 127.0
Mean : 5.938 Mean :160.981 Mean : 4.9786 Mean : 246.7
3rd Qu.: 8.000 3rd Qu.:208.581 3rd Qu.: 6.4341 3rd Qu.: 301.0
Max. :30.000 Max. :475.850 Max. :53.6673 Max. :1311.0
If you are working interactively (not in an RMarkdown document), you might try using View()
. This gives you a spread-sheet like view that you can sort and filter (but cannot edit).
The result will look something (but not quite exaclty) like this in RStudio:
pander()
prints a “pretty” version of your data set and works in both PDF and HTML formats.
name | birthmonth | birthyear | length | width | sex | biggerfoot | domhand |
---|---|---|---|---|---|---|---|
David | 5 | 88 | 24.4 | 8.4 | B | L | R |
Lars | 10 | 87 | 25.4 | 8.8 | B | L | L |
Zach | 12 | 87 | 24.5 | 9.7 | B | R | R |
Josh | 1 | 88 | 25.2 | 9.8 | B | L | R |
Lang | 2 | 88 | 25.1 | 8.9 | B | L | R |
Scotty | 3 | 88 | 25.7 | 9.7 | B | R | R |
Edward | 2 | 88 | 26.1 | 9.6 | B | L | R |
Caitlin | 6 | 88 | 23 | 8.8 | G | L | R |
Eleanor | 5 | 88 | 23.6 | 9.3 | G | R | R |
Damon | 9 | 88 | 22.9 | 8.8 | B | R | L |
Mark | 9 | 87 | 27.5 | 9.8 | B | R | R |
Ray | 3 | 88 | 24.8 | 8.9 | B | L | R |
Cal | 8 | 87 | 26.1 | 9.1 | B | L | R |
Cam | 3 | 88 | 27 | 9.8 | B | L | R |
Julie | 11 | 87 | 26 | 9.3 | G | L | R |
Kate | 4 | 88 | 23.7 | 7.9 | G | R | R |
Caroline | 12 | 87 | 24 | 8.7 | G | R | L |
Maggie | 3 | 88 | 24.7 | 8.8 | G | R | R |
Lee | 6 | 88 | 26.7 | 9 | G | L | L |
Heather | 3 | 88 | 25.5 | 9.5 | G | R | R |
Andy | 6 | 88 | 24 | 9.2 | B | R | R |
Josh | 7 | 88 | 24.4 | 8.6 | B | L | R |
Laura | 9 | 88 | 24 | 8.3 | G | R | L |
Erica | 9 | 88 | 24.5 | 9 | G | L | R |
Peggy | 10 | 88 | 24.2 | 8.1 | G | L | R |
Glen | 7 | 88 | 27.1 | 9.4 | B | L | R |
Abby | 2 | 88 | 26.1 | 9.5 | G | L | R |
David | 12 | 87 | 25.5 | 9.5 | B | R | R |
Mike | 11 | 88 | 24.2 | 8.9 | B | L | R |
Dwayne | 8 | 88 | 23.9 | 9.3 | B | R | L |
Danielle | 6 | 88 | 24 | 9.3 | G | L | R |
Caitlin | 7 | 88 | 22.5 | 8.6 | G | R | R |
Leigh | 3 | 88 | 24.5 | 8.6 | G | L | R |
Dylan | 4 | 88 | 23.6 | 9 | B | R | L |
Peter | 4 | 88 | 24.7 | 8.6 | B | R | L |
Hannah | 3 | 88 | 22.9 | 8.5 | G | L | R |
Teshanna | 3 | 88 | 26 | 9 | G | L | R |
Hayley | 1 | 88 | 21.6 | 7.9 | G | R | R |
Alisha | 9 | 88 | 24.6 | 8.8 | G | L | R |
name | birthmonth | birthyear | length | width | sex | biggerfoot | domhand |
---|---|---|---|---|---|---|---|
David | 5 | 88 | 24.4 | 8.4 | B | L | R |
Lars | 10 | 87 | 25.4 | 8.8 | B | L | L |
Zach | 12 | 87 | 24.5 | 9.7 | B | R | R |
Josh | 1 | 88 | 25.2 | 9.8 | B | L | R |
Lang | 2 | 88 | 25.1 | 8.9 | B | L | R |
Scotty | 3 | 88 | 25.7 | 9.7 | B | R | R |
Edward | 2 | 88 | 26.1 | 9.6 | B | L | R |
Caitlin | 6 | 88 | 23 | 8.8 | G | L | R |
Eleanor | 5 | 88 | 23.6 | 9.3 | G | R | R |
Damon | 9 | 88 | 22.9 | 8.8 | B | R | L |
Mark | 9 | 87 | 27.5 | 9.8 | B | R | R |
Ray | 3 | 88 | 24.8 | 8.9 | B | L | R |
Cal | 8 | 87 | 26.1 | 9.1 | B | L | R |
Cam | 3 | 88 | 27 | 9.8 | B | L | R |
Julie | 11 | 87 | 26 | 9.3 | G | L | R |
Kate | 4 | 88 | 23.7 | 7.9 | G | R | R |
Caroline | 12 | 87 | 24 | 8.7 | G | R | L |
Maggie | 3 | 88 | 24.7 | 8.8 | G | R | R |
Lee | 6 | 88 | 26.7 | 9 | G | L | L |
Heather | 3 | 88 | 25.5 | 9.5 | G | R | R |
Andy | 6 | 88 | 24 | 9.2 | B | R | R |
Josh | 7 | 88 | 24.4 | 8.6 | B | L | R |
Laura | 9 | 88 | 24 | 8.3 | G | R | L |
Erica | 9 | 88 | 24.5 | 9 | G | L | R |
Peggy | 10 | 88 | 24.2 | 8.1 | G | L | R |
Glen | 7 | 88 | 27.1 | 9.4 | B | L | R |
Abby | 2 | 88 | 26.1 | 9.5 | G | L | R |
David | 12 | 87 | 25.5 | 9.5 | B | R | R |
Mike | 11 | 88 | 24.2 | 8.9 | B | L | R |
Dwayne | 8 | 88 | 23.9 | 9.3 | B | R | L |
Danielle | 6 | 88 | 24 | 9.3 | G | L | R |
Caitlin | 7 | 88 | 22.5 | 8.6 | G | R | R |
Leigh | 3 | 88 | 24.5 | 8.6 | G | L | R |
Dylan | 4 | 88 | 23.6 | 9 | B | R | L |
Peter | 4 | 88 | 24.7 | 8.6 | B | R | L |
Hannah | 3 | 88 | 22.9 | 8.5 | G | L | R |
Teshanna | 3 | 88 | 26 | 9 | G | L | R |
Hayley | 1 | 88 | 21.6 | 7.9 | G | R | R |
Alisha | 9 | 88 | 24.6 | 8.8 | G | L | R |
Don’t forget na.rm = TRUE
if you have missing data.
Adelie Chinstrap Gentoo
38.79139 48.83382 47.50488
Adelie Chinstrap Gentoo
38.79139 48.83382 47.50488
Adelie Chinstrap Gentoo
38.79139 48.83382 47.50488
Adelie Chinstrap Gentoo
2.663405 3.339256 3.081857
Adelie Chinstrap Gentoo
2.663405 3.339256 3.081857
Adelie Chinstrap Gentoo
2.663405 3.339256 3.081857
You have three options:
setosa versicolor virginica
0.3524897 0.5161711 0.6358796
setosa versicolor virginica
0.3524897 0.5161711 0.6358796
setosa versicolor virginica
0.3524897 0.5161711 0.6358796
var()
works just like sd()
and computes the variance (square of standard deviation).
[1] 0.6856935
setosa versicolor virginica
0.1242490 0.2664327 0.4043429
setosa versicolor virginica
0.1242490 0.2664327 0.4043429
setosa versicolor virginica
0.1242490 0.2664327 0.4043429
Other statistics work just like mean and standard deviation.
[1] 5.8
[1] 1.3
[1] 7.9
[1] 4.3
df_stats()
provides a convenient way to produce numerical summaries in a tabular format. After the formula and data, just list the summary statistics you want.
response mean sd min max
1 bill_depth_mm 17.15117 1.974793 13.1 21.5
You can choose custom names if you like.
library(palmerpenguins)
df_stats( ~ bill_length_mm, data = penguins,
x.bar = mean, sd = sd, shortest = min, longest = max)
response x.bar sd shortest longest
1 bill_length_mm 43.92193 5.459584 32.1 59.6
We can compute our statisics within groups.
response species x_bar sd
1 bill_length_mm Adelie 38.79139 2.663405
2 bill_length_mm Chinstrap 48.83382 3.339256
3 bill_length_mm Gentoo 47.50488 3.081857
response species x_bar sd
1 bill_length_mm Adelie 38.79139 2.663405
2 bill_length_mm Chinstrap 48.83382 3.339256
3 bill_length_mm Gentoo 47.50488 3.081857
For categorical data, we probably want to compute counts or proportions.
response n_Adelie n_Chinstrap n_Gentoo prop_Adelie prop_Chinstrap prop_Gentoo
1 species 152 68 124 0.4418605 0.1976744 0.3604651
response n_alcohol n_cocaine n_heroin prop_alcohol prop_cocaine prop_heroin
1 substance 177 152 124 0.3907285 0.3355408 0.2737307
response sex n_alcohol n_cocaine n_heroin prop_alcohol prop_cocaine
1 substance female 36 41 30 0.3364486 0.3831776
2 substance male 141 111 94 0.4075145 0.3208092
prop_heroin
1 0.2803738
2 0.2716763
We can compute summary statistics for multiple reponse variables at once.
library(palmerpenguins)
df_stats( bill_length_mm + bill_depth_mm ~ species, data = penguins, mean, sd, median, iqr)
response species mean sd median iqr
1 bill_length_mm Adelie 38.79139 2.6634048 38.80 4.000
2 bill_length_mm Chinstrap 48.83382 3.3392559 49.55 4.725
3 bill_length_mm Gentoo 47.50488 3.0818574 47.30 4.250
4 bill_depth_mm Adelie 18.34636 1.2166498 18.40 1.500
5 bill_depth_mm Chinstrap 18.42059 1.1353951 18.45 1.900
6 bill_depth_mm Gentoo 14.98211 0.9812198 15.00 1.500
If you don’t provide any statistics, you get some favorites:
response min Q1 median Q3 max mean sd n missing
1 bill_length_mm 32.1 39.225 44.45 48.5 59.6 43.92193 5.459584 342 2
response species min Q1 median Q3 max mean sd n
1 bill_length_mm Adelie 32.1 36.75 38.80 40.750 46.0 38.79139 2.663405 151
2 bill_length_mm Chinstrap 40.9 46.35 49.55 51.075 58.0 48.83382 3.339256 68
3 bill_length_mm Gentoo 40.9 45.30 47.30 49.550 59.6 47.50488 3.081857 123
missing
1 1
2 0
3 1
favstats()
is similar to df_stats()
without any statistics specified.
min Q1 median Q3 max mean sd n missing
32.1 39.225 44.45 48.5 59.6 43.92193 5.459584 342 2
species min Q1 median Q3 max mean sd n missing
1 Adelie 32.1 36.75 38.80 40.750 46.0 38.79139 2.663405 151 1
2 Chinstrap 40.9 46.35 49.55 51.075 58.0 48.83382 3.339256 68 0
3 Gentoo 40.9 45.30 47.30 49.550 59.6 47.50488 3.081857 123 1
The correlation coefficient is a bit different because it requires two quantitative variables, so our basic formula shape is y ~ x
.
[1] 0.6410961
[1] 0.6410961
Dealing with missing data when computing a correlation coefficient is a bit different from how we did it for our other numerical summaries because (a) there are two variables involved instead of just one, and (b) the cor()
function is set up to compute many correlation coefficients all at once, so there are several different rules that could be used for handling missing data.
In short: instead of na.rm = TRUE
, we use use = "complete.obs"
or use = "pairwise.complete.obs"
.
[1] NA
[1] 0.3408627
If a data frame has only numeric variables (or if we select only numeric variables), then cor()
will compute the correlation coefficients of all pairs of variables at once an return them in a table.
Now 'complete.obs'
and 'pairwise.complete.obs'
might be different.
gestation wt age ht dht
gestation 1.00000000 0.39189771 -0.06264927 0.02960305 0.01128574
wt 0.39189771 1.00000000 0.03250378 0.21198453 0.10433742
age -0.06264927 0.03250378 1.00000000 0.01821807 -0.05911042
ht 0.02960305 0.21198453 0.01821807 1.00000000 0.34239424
dht 0.01128574 0.10433742 -0.05911042 0.34239424 1.00000000
gestation wt age ht dht
gestation 1.00000000 0.40785397 -0.056269348 0.063630776 0.00572266
wt 0.40785397 1.00000000 0.033648749 0.198518671 0.10771129
age -0.05626935 0.03364875 1.000000000 0.003050816 -0.05177558
ht 0.06363078 0.19851867 0.003050816 1.000000000 0.34086270
dht 0.00572266 0.10771129 -0.051775577 0.340862699 1.00000000
If we want to get fancy, we can turn this table into a colorized plot. The ggcorrplot
package includes a function to do all the work for us. Here are some examples.
Logical operators are often useful for creating proportions.
==
(note double equal for checking if things are equal)!=
>
(<
)>=
(<=
)prop_TRUE
0.3333333
Logical operators are often useful for creating proportions.
==
(note double equal for checking if things are equal)!=
>
(<
)>=
(<=
)We can compute proportions within groups:
prop_TRUE.alcohol prop_TRUE.cocaine prop_TRUE.heroin
0.4180791 0.6118421 0.6209677
prop_TRUE.alcohol prop_TRUE.cocaine prop_TRUE.heroin
0.8587571 0.7302632 0.5967742
Often is it useful to summarise a categorical variable with a summary table listing the counts or propotions in each category.
homeless
homeless housed
209 244
homeless
homeless housed
0.4613687 0.5386313
Cross tables are summary tables for two or more variables at once. (Cross tables with 2 variables are often called 2-way tables.)
substance
homeless alcohol cocaine heroin
homeless 103 59 47
housed 74 93 77
We can also express the values as proportions or percents rather than counts.
substance
homeless alcohol cocaine heroin
homeless 0.5819209 0.3881579 0.3790323
housed 0.4180791 0.6118421 0.6209677
substance
homeless alcohol cocaine heroin
homeless 58.19209 38.81579 37.90323
housed 41.80791 61.18421 62.09677
Cross tables are summary tables for two or more variables at once. We can also add in the category total counts.
substance
homeless alcohol cocaine heroin
homeless 103 59 47
housed 74 93 77
Total 177 152 124
substance
homeless alcohol cocaine heroin
homeless 103 59 47
housed 74 93 77
Total 177 152 124
substance
homeless alcohol cocaine heroin Total
homeless 103 59 47 209
housed 74 93 77 244
Total 177 152 124 453
The function tally()
flexibly creates many simple tables and cross-tables, but when included in Rmarkdown files, they are not rendered as the best looking tables.
A simple solution is to use a pipe (%>%
) to “send” the results from tally()
to the function pander()
(from package pander
) for formatting. Compare:
substance
homeless alcohol cocaine heroin
homeless 103 59 47
housed 74 93 77
Total 177 152 124
alcohol | cocaine | heroin | |
---|---|---|---|
homeless | 103 | 59 | 47 |
housed | 74 | 93 | 77 |
Total | 177 | 152 | 124 |
Data visualizations are created using a palette of attributes or properties such as
The key to designing a visualization is deciding which of these attributes will be used to represent which variables in our data set.
Since every mark on our visualization must go somewhere, it is often good to start out by deciding what will be mapped to the x-axis and what will be mapped to the y-axis. Then we can bring in additional attributes to represent additional variables as needed.
We can take that approach as an outline for an anatomy of plots
Q v Q:
scatterplots (possibly jittered) are the go to plot for looking at two quantiative variables.
Q v C:
scatterplots – doesn’t work well except for small data sets
jittered scatterplots – see sina plots below for another option
boxplots – simple but crude plot based on a five-number summary
violin plots – mirrored density plots that how more of the shape
sina plots – combine well with violin plots, showing a dot for each observation
C v C:
stacked bar plots
dodged bar plots
mosaic plots
For some plots one axis (typically y) is calculated based on the values of the variable on the other axis.
Q:
histogram (or density histogram)
density plot (filled)
density plot (open)
frequency polygon
dot plot – advantage: there is a dot representing each observation; disadvantage: it can be tricky to get all the sizing just right to make these plots look good.
C (counts or proportions):
Each of these plots can be extended to show additional variables using color, shape, faceting, etc.
Scatter plots can be made with categorial variables as well as quantitative variables, but often there is a lot of overplotting, making it difficult to interpret.
This can often be improved by moving the dots left and right a bit (with no loss of information, since we can still tell what level each dot belongs to).
library(palmerpenguins)
gf_jitter(bill_length_mm ~ species, data = penguins, height = 0, width = 0.3)
gf_sina()
moves things randomly left and write, but moves them farther when there is more data nearby. This helps show the shape of the distribution.
A smooth version of the outline is a called a violin plot.
library(palmerpenguins)
gf_violin (bill_length_mm ~ species, data = penguins) %>%
gf_sina(bill_length_mm ~ species, data = penguins)
We can increase or decrease adjust
from 1 to get a smoother or wigglier version of these plots. This is similar to adjusting the binwidth for a histogram.
You typically need to experiment a bit with the number of bins (or the binwidth) to get a good view of the data.
You typically need to experiment a bit with binwidth to get a good view of the data.
A density histogram is a histogram that has been rescaled so that the total area is 1.
Two ways:
Dot plots are like histograms that stack up circles instead of rectangles. This makes it easier to see a representation of each observation, but it can take some fiddling to get the dot sizing just right for your plot. Also, the y-axis scale is not labeled very well for these plots.
Boxplots require two variables in ggformula
.
Facets can be added to any of the plots. These examples use histograms.
Layers can be stacked on top of each other to build up interesting plots.
Map the fill
property to create stacked bars. Specify fill = ~
Name of your categorical variable:
What if you have pre-summarized data (instead of one row per case, you have a variable that gives the categories and a variable that tells how many observations are in each category)? For example, like the dataset below (which is called NHANESsummary
):
PregnantNow | Education | Number_of_Cases |
---|---|---|
Yes | 9 - 11th Grade | 1 |
Yes | High School | 1 |
Yes | Some College | 10 |
Yes | College Grad | 7 |
No | 8th Grade | 16 |
No | 9 - 11th Grade | 49 |
No | High School | 101 |
No | Some College | 240 |
No | College Grad | 214 |
Unknown | High School | 1 |
Unknown | Some College | 8 |
Unknown | College Grad | 5 |
gf_col()
can make a bar graph from this summary data.
Two ways:
In ggformula, a pie chart is just a bar chart that using “polar coordinates”. The look much better if we turn off all the axis labeling with theme_void()
.
Remember, only use a pie chart if it visually delivers the messsage you want to send. A pie chart usually works best when there are only a few pie pieces (or one or two big ones and the rest very small). If the pie pieces are similar in size, it will be easier to tell which is bigger with a bar chart rather than a pie. One thing a pie chart can be good for is comparing a proportion to 0.5 since a half-circle has a straight line, which our eyes can detect well.
Regression is an old term that comes from an observation of an effect known as “regression toward the mean”. A more modern term for regression is a linear model, which is much more descriptive of what it is.
In a Simple Linear Model* (also called Simple Linear Regression**), there is one explanatory variable (x) and one response variable (y). The simple linear model has the form
\[ y_i = b_0 + b_1 x_i + \mbox{residual}_i \] The intercpet (\(b_0\)) and slope (\(b_1\)) are chosen so that the sum of the squares of the residuals (abbreviated SSR or SSE) is as small as possible. So another name for this line is the least squares regression line.
A linear model is not always appropriate. Various plots of the data and the model can help us determine whether a linar model is a good idea.
The linear model if fit by changing gf_point()
to lm()
Call:
lm(formula = width ~ length, data = KidsFeet)
Coefficients:
(Intercept) length
2.8623 0.2479
We can get all the fitted values using fitted()
1 2 3 4 5 6 7 8
8.912201 9.160149 8.936996 9.110560 9.085765 9.234534 9.333713 8.565075
9 10 11 12 13 14 15 16
8.713843 8.540280 9.680840 9.011381 9.333713 9.556866 9.308918 8.738638
17 18 19 20 21 22 23 24
8.813022 8.986586 9.482481 9.184944 8.813022 8.912201 8.813022 8.936996
25 26 27 28 29 30 31 32
8.862612 9.581660 9.333713 9.184944 8.862612 8.788228 8.813022 8.441101
33 34 35 36 37 38 39
8.936996 8.713843 8.986586 8.540280 9.308918 8.217948 8.961791
To make predictions for explanatory variables of our own choosing, we can use makeFun()
.
1
8.317127
1
8.565075
1
9.184944
We can get the residuals for each observation using resid()
1 2 3 4 5 6
-0.51220149 -0.36014925 0.76300373 0.68944030 -0.18576493 0.46546642
7 8 9 10 11 12
0.26628731 0.23492537 0.58615672 0.25972015 0.11916045 -0.11138060
13 14 15 16 17 18
-0.23371269 0.24313433 -0.00891791 -0.83863806 -0.11302239 -0.18658582
19 20 21 22 23 24
-0.48248134 0.31505597 0.38697761 -0.31220149 -0.51302239 0.06300373
25 26 27 28 29 30
-0.76261194 -0.18166045 0.16628731 0.31505597 0.03738806 0.51177239
31 32 33 34 35 36
0.48697761 0.15889925 -0.33699627 0.28615672 -0.38658582 -0.04027985
37 38 39
-0.30891791 -0.31794776 -0.16179104
There are several ways to produce different types of summary tables for a linear model.
Call:
lm(formula = width ~ length, data = KidsFeet)
Residuals:
Min 1Q Median 3Q Max
-0.83864 -0.31056 -0.00892 0.27622 0.76300
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8623 1.2081 2.369 0.0232 *
length 0.2480 0.0488 5.081 1.1e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3963 on 37 degrees of freedom
Multiple R-squared: 0.411, Adjusted R-squared: 0.3951
F-statistic: 25.82 on 1 and 37 DF, p-value: 1.097e-05
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.8623 1.2081 2.369 0.0232 *
length 0.2480 0.0488 5.081 1.1e-05 ***
Residual standard error: 0.3963 on 37 degrees of freedom
Multiple R-squared: 0.411, Adjusted R-squared: 0.3951
F-statistic: 25.82 on 1 and 37 DF, p-value: 1.097e-05
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.86 1.21 2.37 0.0232
2 length 0.248 0.0488 5.08 0.0000110
The Lady Tasting Tea is an example.
Let’s suppose she gets 9 out of 10 correct (\(\hat{p} = 0.90\)). We can use rflip()
to generate random guesses and see how often a random guesser gets at least 9 out of 10 correct.
set.seed(12345)
Lady_null <- do(1000) * rflip(10)
gf_histogram(~heads, data = Lady_null, binwidth = 1)
(heads >= 9)
TRUE FALSE
17 983
(prop >= 0.9)
TRUE FALSE
17 983
So our p-value is 17/1000 = 0.017. Since this is quite small, we reject the null hypothesis. We have evidence that she is doing better than just guessing.
Here’s an example based on the game Mitternachtsparty which uses a die with a ghost (Hugo) on it. Let’s test
Data: 16 Hugos in 50 tosses of the die.
To test other proportions, we can use an additional argument torflip()
, namely prob
.
Flipping 50 coins [ Prob(Heads) = 0.166666666666667 ] ...
T T T T T T T T T T T H T T T T T T T H T T T T T H T T T T T T T T T T
H T T H T T H T H T T T T T
Number of Heads: 7 [Proportion Heads: 0.14]
set.seed(12345)
Hugo_null <- do(1000) * rflip(50, prob = 1/6)
gf_histogram(~heads, data = Hugo_null, binwidth = 1, fill = ~(heads >= 16))
(heads >= 16)
TRUE FALSE
2 998
(prop >= 16/50)
TRUE FALSE
2 998
prop_TRUE
0.002
Malaria <- read.csv('https://rpruim.github.io/s145/data/malaria.csv')
head(Malaria, 3) # see what the first few rows look like
group malaria
1 placebo yes
2 placebo yes
3 treatment yes
group
malaria placebo treatment
no 0 9
yes 6 5
group
malaria placebo treatment
no 0.0000000 0.6428571
yes 1.0000000 0.3571429
diffprop
0.6428571
response Group min Q1 median Q3 max mean sd n missing
1 Taps Caffeine 245 246.5 248.0 250.00 252 248.3 2.213594 10 0
2 Taps No Caffeine 242 242.5 244.5 246.75 248 244.8 2.394438 10 0
For a larger data set, gf_violin()
, gf_density()
or gf_histogram()
might be useful.
diffmean
-3.5
Taps_null <- do(2000) * diffmean(Taps ~ shuffle(Group), data = CaffeineTaps)
gf_histogram( ~ diffmean, data = Taps_null) %>%
gf_vline(xintercept = ~test_stat)
prop_TRUE
0.004
To create a confidence interval, we would like to know
of the sampling distribution.
The bootstrap allows us to approximate the sampling distribution by
The basic steps are the same each time.
* Create some graphical or numerical summaries.
* Look for anything unusual or unexpected (and figure out what's up
before proceeding)
`estimate()` is a function like `mean()`, `median()`, `prop()`, `diffmean()`, `df_stats()`, ...,
that computes your sample statistic of interest -- your estimate.
glimpse(Bootstrap) # see what it looks like and what the variables are called.
gf_histogram( ~ statistic_of_interest, data = Bootstrap) # histogram of bootstrap distribution
The bootstrap distribution in `Bootstrap` can now be used to estimate standard error or to compute
confidence intervals. See the other tabs for specific examples.
A bootstrap distribution is the distribution of some estimate over many resamples from our data. Most often we will compute the bootstrap distribution using the following fill-in-the-blanks template:
# Compute estimate from the data
estimate(____, data = _____)
# Create a bootstrap distribution
Bootstrap <- do(____) * estimate(____, data = resample(_____))
estimate()
should be replaced with a function that computes our estimate. Examples: mean()
, sd()
, prop()
, df_stats()
, etc.To get a 95% percentile confidence interval, we take the central 95% of the bootstrap distribution using cdate()
:
We can estimate the standard error by taking the standard deviation of our bootstrap distribution. If the bootstrap distribution is approximately normal, then our interval has the form
\[ \mathrm{estimate} \pm 2 SE \]
[1] 1.959964
The most common thing to estimate for a quantitative variable is the mean, but we could estimate other things like the standard deviation, the median, or some other quantile.
library(mosaic)
library(NHANES)
Men50s <- NHANES %>% filter(Age >= 50, Age <= 59, Gender == "male")
estimate <- mean( ~Pulse, data = Men50s, na.rm = TRUE); estimate
[1] 69.85106
response mean_pulse
1 Pulse 69.85106
lower upper central.p
2.5% 68.97708 70.75522 0.95
Smoke100
No Yes
299 382
prop_No
0.4390602
prop_Yes
0.5609398
Smoke_boot <- do(1000) * prop( ~Smoke100, data = resample(Men50s), success = "Yes")
gf_histogram( ~prop_Yes, data = Smoke_boot)
lower upper central.p
2.5% 0.5256975 0.5991189 0.95
The SE interval is similar:
[1] 0.01855384
prop_Yes
0.5238321
prop_Yes
0.5980475
Categorical data is easy to summarise. We can use rflip()
to work with summarized data instead of raw data.
Smoke100
No Yes
299 382
[1] 681
lower upper central.p
2.5% 0.5212922 0.5962188 0.95
[1] 0.01932551
[1] 0.4004092
[1] 0.4777112
How different are the mean pulse for men and for women in their 20s?
In our sample, the mean pulse for women is about 4.5 beats per minute faster.
NHANES20s <- NHANES %>% filter(Age >= 20, Age <= 29)
df_stats(Pulse ~ Gender, data = NHANES20s, mean)
response Gender mean
1 Pulse female 77.00307
2 Pulse male 72.56012
diffmean
-4.442946
Now let’s create a bootstrap distribution to see how precise that estimate is.
Pulse_boot <- do(1000) * diffmean(Pulse ~ Gender, data = resample(NHANES20s), na.rm = TRUE)
gf_histogram( ~ diffmean, data = Pulse_boot)
lower upper central.p
2.5% -5.659382 -3.246555 0.95
The SE interval is similar:
[1] 0.6119071
diffmean
-4.442946
Since this interval is entirely below 0, we have evidence that the men’s mean pulse is slower than the women’s (but not by a lot since the interval doesn’t stretch very far from 0).
How different are the smoking rates for men and for women in their 20s?
In our sample, men are a bit more than 11% more likely to smoke that women are.
NHANES20s <- NHANES %>% filter(Age >= 20, Age <= 29)
tally(Smoke100 ~ Gender, data = NHANES20s, margins = TRUE, format = 'prop')
Gender
Smoke100 female male
No 0.6725404 0.5600000
Yes 0.3274596 0.4400000
Total 1.0000000 1.0000000
diffprop
-0.1125404
Now let’s create a bootstrap distribution to see how precise that estimate is.
Smoke_boot <- do(1000) * diffprop(Smoke100 ~ Gender, data = resample(NHANES20s))
gf_histogram( ~ diffprop, data = Smoke_boot)
lower upper central.p
2.5% -0.164021 -0.05981467 0.95
A standard error interval is similar:
[1] 0.02662732
diffprop
-0.165795
diffprop
-0.05928574
If we do our resampling as in the previous example, we are likely to get a different number of cases in each group in our resamples compared to the original data (and to teach other).
Gender
female male
681 675
Gender
female male
652 704
Gender
female male
687 669
We might prefer to do our resampling separately for each of the two groups. There are at least two reasons why we might prefer to resample within groups:
Some people resample within groups as their primary resampling method. (Our text describes resampling for differences in means and proportions that way, for example.) So you could do it that way all the time. But it is important to to resample within groups if either of the two conditions above apply to our situation.
We will illustrate resampling within groups using our comparison of smoking rates for men and women, even though in this study, no attempt was made to predetermine how many men and women were in the study. All we need to do is add a groups
argument to our resample()
command to tell R what variable determines the groups.
Gender
female male
681 675
Gender
female male
681 675
Gender
female male
681 675
Smoke_boot <- do(1000) * diffprop(Smoke100 ~ Gender, data = resample(NHANES20s, groups = Gender))
gf_histogram( ~ diffprop, data = Smoke_boot)
lower upper central.p
2.5% -0.1612916 -0.06231702 0.95
We can also resample within groups when looking at a difference in means.
Sometimes we have two measurements for each case in our data and we want to compare them. Pre- and post- measurements (before and after some intervention or some amount of time) are a common situation where this occurs. If we are interested in the amount of change, we can compute the difference or ratio between these two measurements to get a single variable measuring the change. At that point, we are back to having one quantitative variable and everything procedes as before.
A 2004 paper by Gettelfinger and Cussler investigated whether people can swim faster or slower in syrup vs water. They timed 18 swimmers of various abilities in regular water and in “syrup” (water thickened with guar gum). The reported their data as a ratio of times in syrup over times in water (so a number greater than 1 means slower in syrup and a number less than 1 means faster in syrup).
Pairing each swimmer with himself/herself was designed to compensate for different swimming abilities. On average, swimmers were 1.2% slower in syrup.
relative.speed
1 1.08
2 0.94
3 1.07
4 1.04
5 1.02
6 1.01
response mean_rel.speed
1 relative.speed 1.011667
Let’s get a 95% confidence interval for that estimate.
Syrup_boot <-
do(1000) * df_stats(~relative.speed, data = resample(SyrupSwimming), mean_rel.speed = mean)
gf_histogram(~mean_rel.speed, data = Syrup_boot)
lower upper central.p
2.5% 0.9922222 1.031111 0.95
The SE interval is similar:
[1] 0.009617349
[1] 0.992432
[1] 1.030901
Our intervals suggests that it is plausible that the syrup times are between 1% faster and 3% slower than the water times. If there is a difference, it doesn’t seem to be very large, and it isn’t certain which is faster, swimming in syrup or in water.
This example goes back to an old conversation between physicists Isaac Newton and Christiaan Huygens about how viscosity of a liquid would affect an objects speed through the liquid. The paper by Gettelfinger and Cussler sparked renewed interest in this topic:
If your raw data has the two measurements not yet combined, the first step is to create a new variable. We can do that with the mutate()
function.
The ucla_textbooks_f18
data set has prices for textbooks at the campus bookstore and at Amazon. Let’s see how different the prices are on average. Since different books cost different amounts, we want to compare each books price in the bookstore to that books price on Amazon. That’s a paired design.
We could use subtraction for absolute differences or division for percentage differences. Which makes more sense depends a bit on what you believe about pricing: Do you expect roughly the same difference for books in all price ranges, or do you expect more difference when books are more expensive? We can also use the data to investigate which of these seems to be the case.
Rows: 201
Columns: 20
$ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018…
$ term <fct> Fall, Fall, Fall, Fall, Fall, Fall, Fall, Fall, Fall…
$ subject <fct> American Indian Studies, Anthropology, Art, Arts and…
$ subject_abbr <fct> AM IND, ANTHRO, NA, ART&ARC, NA, ASTR, BMD RES, CHEM…
$ course <fct> "Introduction to American Indian Studies", "Archaeol…
$ course_num <fct> M10, 2, 11D, 10, M60W, 4, 5HA, 14A, 14B, M5A, 10, 1E…
$ course_numeric <int> 10, 2, 11, 10, 60, 4, 5, 14, 14, 5, 10, 1, 2, 9, 10,…
$ seminar <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ ind_study <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ apprenticeship <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ internship <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ honors_contracts <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ laboratory <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ special_topic <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ textbook_isbn <chr> "9781138477858", "9780307741806", NA, "9780979757549…
$ bookstore_new <dbl> 47.97, 14.26, NA, 13.50, 49.26, 119.97, NA, 188.10, …
$ bookstore_used <dbl> 44.97, 10.96, NA, 11.00, 43.26, 67.00, NA, 149.00, 1…
$ amazon_new <dbl> 47.45, 13.55, NA, 12.53, 54.95, 124.80, NA, NA, NA, …
$ amazon_used <dbl> 51.20, 7.10, NA, NA, 24.83, 47.75, NA, NA, 90.00, NA…
$ notes <fct> , , NA, , NA, NA, NA, , New version only from resell…
It does look like the price swings are larger for more expensive books, so let’s use a ratio. (Note also the 133 books are missing at least one price. We will remove those below.)
Books <-
ucla_textbooks_f18 %>%
mutate(price_ratio = bookstore_new / amazon_new) %>%
filter(!is.na(price_ratio))
df_stats(~price_ratio, data = Books, mean_ratio = mean)
response mean_ratio
1 price_ratio 1.065878
Looks like bookstrore prices are 6% higher on average. Let’s create a confidence interval for that ratio.
Books_boot <-
do(1000) * df_stats(~price_ratio, data = resample(Books), mean_ratio = mean)
gf_histogram(~ mean_ratio, data = Books_boot)
lower upper central.p
2.5% 1.02792 1.111962 0.95
Our 95% percentile confidence interval is between 2.8% and 11.2% higher for the average price of a bookstore book vs. purchasing the book on Amazon.
Because the bootstrap distribution is a bit skewed, we won’t compute the SE confidence interval for this example.
df_stats()
response mean
1 BodyTemp 98.26
# mean body temp in 1 bootstrap sample
# * use resample(BodyTemp50) instead of BodyTemp50
df_stats( ~ BodyTemp, data = resample(BodyTemp50), mean)
response mean
1 BodyTemp 98.222
# do that 3 times (so we can see what the result looks like)
do(3) * df_stats( ~ BodyTemp, data = resample(BodyTemp50), mean)
response mean .row .index
1 BodyTemp 98.170 1 1
2 BodyTemp 98.344 1 2
3 BodyTemp 98.306 1 3
response mean_BodyTemp .row .index
1 BodyTemp 98.196 1 1
2 BodyTemp 98.158 1 2
3 BodyTemp 98.114 1 3
4 BodyTemp 98.394 1 4
5 BodyTemp 98.266 1 5
6 BodyTemp 98.304 1 6
This method is only accurate if the bootstrap distribution is approximately normal and our confidence level is 95%.
# could use df_stats() here, but is easier to do arithmetic this way
SE <- sd( ~ mean_BodyTemp, data = BodyTemp.boot); SE
[1] 0.1065861
[1] 98.04683 98.47317
This method will work better if the bootstrap distribution is skewed or has heavier or lighter tails than a normal distribution. It is also easier to adjust this method for use with other confidence levels.
lower upper central.p
2.5% 98.05195 98.47 0.95
lower upper central.p
0.5% 98.00595 98.52005 0.99
mean()
[1] 98.26
# mean body temp in 1 bootstrap sample
# * use resample(BodyTemp50) instead of BodyTemp50
mean( ~ BodyTemp, data = resample(BodyTemp50))
[1] 98.432
# do that 3 times (so we can see what the result looks like)
do(3) * mean( ~ BodyTemp, data = resample(BodyTemp50))
mean
1 98.342
2 98.116
3 98.296
mean
1 98.044
2 98.204
3 98.376
4 98.290
5 98.184
6 98.200
mean
1 98.044
2 98.204
3 98.376
4 98.290
5 98.184
6 98.200
This method will work better if the bootstrap distribution is skewed or has heavier or lighter tails than a normal distribution. It is also easier to adjust this method for use with other confidence levels.
lower upper central.p
2.5% 98.05595 98.48405 0.95
lower upper central.p
0.5% 98.006 98.57203 0.99
What percentage of Americans have BMI at least 30?
prop_TRUE
0.2871082
prop_TRUE
1 0.2909921
2 0.2877496
3 0.2884436
prop_TRUE
1 0.2909921
2 0.2877496
3 0.2884436
4 0.2898641
5 0.2822505
6 0.2888173
prop_TRUE
1 0.2909921
2 0.2877496
3 0.2884436
4 0.2898641
5 0.2822505
6 0.2888173
We can use this method if the bootstrap distribution is pretty unimodal, symmetric, and bell-shaped.
prop_TRUE
0.2871082
[1] 0.00464423
[1] 0.2778197 0.2963966
To compute a probability from an x-value, we can use pnorm()
(or, for output including a figure as well as just the numeric value, xpnorm()
). These functions report the area to the left of the specified x-value under a Normal curve with a certain mean
and sd
:
For example, find the probability of getting a value less than 0 from a normal distribution with mean 3 and standard deviation 5:
[1] 0.2742531
If we add the input lower.tail=FALSE
, then the area above the stated x-value will be computed instead:
[1] 0.7257469
We can use function qnorm()
(or xqnorm()
) to find x-values corresponding to probabilites.
For example, if we want to find the lower bound of a 95% CI, knowing that the relevant sampling distribution is well modelled by a normal distribution centered at 14.5 with a standard error of 2.7:
[1] 9.208097
The upper bound of the same CI, two ways:
[1] 19.7919
[1] 19.7919
In a scenario where we are considering a random variable \(X\) that has a binomial distribution with number of trials \(n\) and probability of success \(p\), we can compute the probability density at \(k\) (the chance of getting exactly \(k\) successes) as:
\[ P(X=k) = {n \choose k} p^k (1-p)^{(n-k)}\] Or, in R:
For example, imagine an airline flies planes with 120 seats, but they sell 125 tickets for each flight, over booking them all! Each passenger has a 0.88 chance of showing up and flying (and a 0.12 chance of missing the flight or being a no-show). The situation can be modelled with a binomial distribution (the passengers can be considered independent trails). We want to find the probability that 124 people show up for a flight. Here, \(n=125\) and \(p=0.88\), so we want to find
\[ P(X=124) = {125 \choose 124} 0.88^{124} (1-0.88)^{(125-124)}\]
In R,
[1] 1.958586e-06
If we wanted to figure out the probability that more than 120 people would show up, we could find the (mutually exclusive) probabilities that \(k\) is 121, 122, 123, 124, and 125 and add them up. Or… (see the “Probabilities” tab for a shortcut…)
In a scenario where we are considering a random variable \(X\) that has a binomial distribution, with number of trials \(n\) and probability of success \(p\), we can compute the probability density at \(k\) (the chance of getting exactly \(k\) successes) as:
\[ P(X=k) = {n \choose k} p^k (1-p)^{(n-k)}\]
It follows that the probability of X being \(k\) or less is:
\[ P(X \leq K) = \sum_{k=0}^K {n \choose k} p^k (1-p)^{(n-k)}\]
This is the discrete version of finding “the area under the distribution curve” for \(X \leq k\). We can use R function pbinom()
to compute it:
For example, imagine an airline flies planes with 120 seats, but they sell 125 tickets for each flight, over-booking them all! Each passenger has a 0.88 chance of showing up and flying (and a 0.12 chance of missing the flight or being a no-show). The situation can be modelled with a binomial distribution (the passengers can be considered independent trails). We want to find the probability that more than 120 show up for a flight. Here, \(n=125\) and \(p=0.88\), and we need to compute
\[ 1 - P(X \leq 120) = 1 - \sum_{k=0}^{120} {n \choose k} p^k (1-p)^{(n-k)}\]
(The “1-” is because we are computing the probability of getting no more than 120 passengers on the 120-seat flight, and we were asked to compute the probability of getting more than 120 passengers.)
In R, we would compute:
[1] 0.000496259
Or alternatively, we could find \(P(X > K)\) instead of \(P(X \leq K)\) by adding input lower.tail=FALSE
, and find the same answer via:
[1] 0.000496259