There are several excellent graphics packages provided for R. The ggformula
package currently builds on one of them, ggplot2
, but provides a very different user interface for creating plots. The interface is based on formulas (much like the lattice
interface) and the use of the chaining operator (%>%
) to build more complex graphics from simpler components.
Experts, of course, will want to use the underlying ggplot2
package directly in order to maximize expressiveness and minimize the “bureaucracy” induced by intervening layers of software. The ggformula
graphics are suited for
beginners, who don’t require the full range of ggplot2
capabilities, but want to get started quickly,
those familiar with lattice
graphics, but wanting to be able to easily create multilayered plots,
those who prefer a formula interface, perhaps because it is familiar from use with function like lm()
or from use of the mosaic
package for numerical summaries.
The basic template for creating a plot with ggformula
is
gf_plottype(formula, data = mydata)
where
plottype
describes the type of plot (layer) desired (points, lines, a histogram, etc., etc.),
mydata
is a data frame containing the variables used in the plot, and
formula
describes how/where those variables are used.
For example, in a bivariate plot, formula
will take the form y ~ x
, where y
is the name of a variable to be plotted on the y-axis and x
is the name of a variable to be plotted on the x-axis. (It is also possible to use expressions that can be evaluated using variables in the data frame as well.)
Here is a simple example:
library(ggformula)
gf_point(mpg ~ hp, data = mtcars)
The “kind of graphic” is specified by the name of the graphics function. All of the ggformula
data graphics functions have names starting with gf_
, which is intended to remind the user that they are formula-based interfaces to ggplot2
: g
for ggplot2
and f
for “formula.” Commonly used functions include
gf_point()
for scatter plotsgf_line()
for line plots (connecting dots in a scatter plot)gf_density()
or gf_dens()
or gf_histogram()
or gf_freqpoly()
to display distributions of a quantitative variablegf_boxplot()
or gf_violin()
for comparing distributions side-by-sidegf_counts()
for bar-graph style depictions of counts.gf_bar()
for more general bar-graph style graphicsThe function names generally match the corresponding function name from ggplot2
, although gf_counts()
is a simplified special case, and gg_dens()
is an alternative to gg_density()
that displays the density plot slightly differently than the default in ggplot2
.
Each of the gf_
functions can create the coordinate axes and fill it in one operation. (In ggplot2
nomenclature, gf_
functions create a frame and add a geom layer, all in one operation.) This is what happens for the first gf_
function in a chain. For subsequent gf_
functions, new layers are added, each one “on top of” the previous layers.
Each of the marks in the plot is a glyph. Every glyph has graphical attributes (called aesthetics in ggplot2
) that tell where and how to draw the glyph. In the above plot, the obvious attributes are x- and y-position:
We’ve told R to put mpg
along the y-axis and hp
along the x-asis, as is clear from the plot. But each point also has other attributes, including color, shape, size, stroke, fill, and alpha (transparency). We didn’t specify those in our example, so gg_point()
uses some default values for those – in this case smallish black filled-in circles.
In the gf_
functions, you specify the non-position graphical attributes using an extension of the basic formula. Attributes can be set to a constant value (e.g, set the color to “blue”; set the size to 2) or they can be mapped to a variable in the data or some expression involving the variables (e.g., map the color to sex
, so sex determines the color groupings)
Attributes can be specified in 3 ways:
attribute:value
to the formula [auto chooses set or map]attribute::value
[map]attribute = value
[set]where attribute
is one of color
, shape
, etc. and value
is either a constant (e.g. "red"
or 0.5
, as appropriate), a variable (e.g. cyl
), or some more general expression that can be computed using the variables in data
(although often is is better to create a new variable in the data and to use that variable instead of an on-the-fly calculation within the plot). When using the :
separator, mapping will be used if value
is the name of a variable in data
, otherwise setting will be used. The other two forms can be used to make things explicit when this auto-detection method doesn’t yield the desired results
The following plot, for instance,
We use cyl
to determine the color and carb
to determine the size of each dot. Color and size are mapped to cyl
and carb
. A legend is provided to show us how the mapping is being done. (Later, we can use scales to control precisly how the mapping is done – which colors and sizes are used to represent which values of cyl
and carb
.)
We also set the transpaency to 50%. The gives the same value of alpha
to all glyphs in this layer.
gf_point(mpg ~ hp + color:cyl + size:carb + alpha:0.50, data = mtcars)
Here is the same plot where mapping and setting are indicated explicitly.
# set alpha using a function argument instead of in the formula
gf_point(mpg ~ hp + color::cyl + size::carb, alpha = 0.50, data = mtcars)
ggformula
allows for on-the-fly calculations of attributes, although the default labeling of the plot is often better if we create a new variable in our data frame. In the examples below, since there are only three values for carb
, it is easier to read the graph if we tell R to treat cyl
as a categorical variable by converting to a factor (or to a string). Except for the labeling of the legend, these two plots are the same.
gf_point(mpg ~ hp + color::factor(cyl) + size:carb + alpha:0.75, data = mtcars)
gf_point(mpg ~ hp + color:cylinders + size:carb + alpha:0.75,
data = mtcars %>% mutate(cylinders = factor(cyl)))
For some plots, we only have to specify the x-position because the y-position is cacluated from the x-values. Histograms, densityplots, and frequency polygons are examples. To illustrate, we’ll use density plots, but the same ideas apply to gf_histogram()
, and gf_freqpolygon()
as well. Note that in the one-variable density graphics, the variable whose density is to be calculated goes to the right of the tilde, in the position reserved for the x-axis variable.
data(Runners, package = "statisticalModeling")
Runners <- Runners %>% filter( ! is.na(net))
gf_density( ~ net, data = Runners)
gf_density( ~ net + fill:sex + alpha:0.5, data = Runners)
# gf_dens() is similar, but there is no line at bottom/sides, and it is not "fillable"
gf_dens( ~ net + color:sex + alpha:0.7, data = Runners)
Several of the plotting functions include additional arguments that do not modify attributes of individual glyphs but control some other aspect of the plot. In this case, adjust
can be used to increase or decrease the amount of smothing.
# less smoothing
gf_dens( ~ net + color:sex + alpha:0.7, data = Runners, adjust = 0.25)
# more smoothing
gf_dens( ~ net + color:sex + alpha:0.7, data = Runners, adjust = 4)
When the fill
or color
or group
aesthetics are mapped to a variable, the default behavior is to lay the group-wise densities on top of one another. Other behavior is also available by using position
in the formula. Using the value "stack"
causes the densities to be laid one on top of another, so that the overall height of the stack is the density across all groups. The value "fill"
produces a conditional probability graphic.
gf_density( ~ net + fill:sex + color:NA + position:"stack", data = Runners)
gf_density( ~ net + fill:sex + color:NA + position:"fill", data = Runners, adjust = 2)
Similar commands can be constructed with gf_histogram()
and gf_freqpoly()
, but note that color
, not fill
, is the active aesthetic for frequency polygons and position:"fill"
doesn’t work. It’s also rarely good to overlay histograms on top of one another – better to use a density plot or a frequency polygon for that application.
Sometimes you have so many points in a scatter plot that they obscure one another. The ggplot2
system provides two easy ways to deal with this: translucency and jittering.
Use alpha:0.5
to make the points semi-translucent. If there are many points overlapping at one point, a much smaller value of alpha, say alpha:0.01
. We’ve already seen this above.
Using gf_jitter()
in place of gf_point()
will move the plotted points to reduce overlap. Jitter and transparency can be used together as well.
gf_point(age ~ sex + alpha:0.05, data = Runners)
gf_jitter(age ~ sex + alpha:0.05, data = Runners)
Box and whisker plots show the distribution of a quantitative variable as a function of a categorical variable. The formula used in gf_boxplot()
should have the quantitative variable to the left of the tilde. (To make horizontal boxplots using ggplot2
you have to make vertical boxplots and then flip the coordinates with coord_flip()
.)
gf_boxplot(net ~ sex + color:"red", data = Runners)
gf_boxplot(net ~ sex + color:start_position, data = Runners)
This plot may surprise you.
gf_boxplot(net ~ year, data = Runners)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
This plot is placing a single box and whisker plot at the mean value of year
. The warning message suggests that we need to tell R how to form the groups when using a quantitive variable for x
. It suggests using the group
aesthetic, and sometimes, this is just what we want.
gf_boxplot(net ~ year + group:year, data = Runners)
But often, is is better to convert a discrete quantitative variable used for grouping into a categorical variable (a factor or character vector). This can be done in several ways:
# add a new variable to the data
Runners$the_year <- as.character(Runners$year) # in base R
Runners <- Runners %>% mutate(the_year = as.character(year)) # in dplyr
gf_boxplot(net ~ the_year + color:sex, data = Runners)
# or do it on the fly
gf_boxplot(net ~ factor(year) + color:sex, data = Runners)
Two-dimensional plots of density also have both a left and right component to the formula.
gf_density_2d(net ~ age, data = Runners)
gf_hex(net ~ age, data = Runners)
The ggplot2
system offers two ways to connect points. gf_line()
ignores the order of the points in the data, and draws the line going from left to right. gf_path()
goes from point to point according to the order in the data. Both forms can use a color
or group
aesthetic as a flag to draw groupwise lines.
# create a categorical variable
mtcars <- mtcars %>% mutate(n_cylinders = as.character(cyl))
gf_line(mpg ~ hp, data = mtcars)
gf_path(mpg ~ hp, data = mtcars)
gf_line(mpg ~ hp + color:n_cylinders, data = mtcars)
The above are examples of bad plots. The viewer is unnecessarily distracted by the zigs and zags in the connecting lines. It would be better to use gf_point()
here, but then you wouldn’t see how gf_line()
and gf_path()
work!
Here’s a more useful example. We begin with a scatter plot showing the number of live births in the US for each day of 1978.
library(mosaicData)
gf_point(births ~ date, data = Births78)
Can this interesting pattern be explained by a weekday/weekend effect? We could use color to show the days of the week.
library(mosaicData)
gf_point(births ~ date + color:wday, data = Births78)
Converting to a line plot highlights the pattern and makes it easier to spot the unusual days.
gf_line(births ~ date + color:wday, data = Births78)
Some layers require more than two attributes. Typically this happens when the glyphs of a layer are complex objects that could have been made using multiple layers, but belong together conceptually. Examples include
gf_pointrange()
– plots a dot flanked by a line segment.gf_linrange()
– like gf_pointrange()
but without the point.gf_errorbar()
– vertical error bars.gf_errorbarh()
– horizontal error bars.gf_ribbon()
– a band between a line above and line below.Often these are used to depict some sort of estimate of uncertainty in a measurement or a prediction, but they can be used to represent any data of the correct form. Here we will use gf_linerange()
and gf_ribbon()
to indicate the high and low temperatures in New York for the first few months of 2013.
require(weatherData)
## Loading required package: weatherData
Temps <- NewYork2013 %>%
mutate(date = lubridate::date(Time),
month = lubridate::month(Time)) %>%
filter(month <= 4) %>%
group_by(date) %>%
summarise(
hi = max(Temperature, na.rm = TRUE),
lo = min(Temperature, na.rm = TRUE)
)
gf_linerange(lo + hi + color:hi ~ date, data = Temps)
gf_ribbon(lo + hi ~ date, data = Temps, color = "navy", alpha = 0.3)
Often it is useful to overlay multiple layers onto a single plot. This can be done by chaining them with %>%
, the “then” operator from magrittr
. The data
argument can be omitted if the new layers uses the same data as the first layer in the chain.
The following plot illustrates how histograms and frequency polygons are related.
gf_histogram( ~ age, data = Runners, alpha = 0.3, fill = "navy") %>%
gf_freqpoly( ~ age)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A 2-d density plot can be augmented with a scatterplot.
gf_density_2d(net ~ age, data = Runners) %>%
gf_point(net ~ age + alpha:0.01)
The ggplot2
system allows you to make subplots — called “facets” — based on the values of one or two categorical variables. This is done by chaining with gf_facet_grid()
or gf_facet_wrap()
. These functions use formulas to specify which variable(s) are to be used for faceting.
gf_density_2d(net ~ age, data = Runners) %>% gf_facet_grid( ~ sex)
# the dot here is a bit strange, but required to make a valid formula
gf_density_2d(net ~ age, data = Runners) %>% gf_facet_grid( sex ~ .)
gf_density_2d(net ~ age, data = Runners) %>% gf_facet_wrap( ~ the_year)
gf_density_2d(net ~ age, data = Runners) %>% gf_facet_grid(start_position ~ sex)
One use of facets is to create side-by-side plots of different data sources.
First we merge the data into a single data frame with a variable to indicate the original data source. Then we use facets to separate the data into the various sources for the display. The redundant use of the y
and color
attributes for temperature makes it easier to compare across facets.
require(weatherData)
Temps <- NewYork2013 %>% mutate(city = "NYC") %>%
bind_rows(Mumbai2013 %>% mutate(city = "Mumbai")) %>%
bind_rows(London2013 %>% mutate(city = "London")) %>%
mutate(date = lubridate::date(Time),
month = lubridate::month(Time)) %>%
group_by(city, date) %>%
summarise(
hi = max(Temperature, na.rm = TRUE),
lo = min(Temperature, na.rm = TRUE),
mid = (hi + lo)/2
)
gf_ribbon(lo + hi ~ date, data = Temps, alpha = 0.3) %>%
gf_facet_grid(city ~ .)
gf_linerange(lo + hi + color:mid ~ date, data = Temps) %>%
gf_facet_grid(city ~ .) %>%
gf_refine(scale_colour_gradientn(colors = rev(rainbow(5))))
There are a number of things we may want to do to the entire plot – adjusting labels, colors, fonts, etc. ggformula
provides wrappers to the ggplot2
functions for this so that the chaining syntax can be used.
gf_histogram( ~ age, data = Runners, alpha = 0.2, fill = "navy",
binwidth = 5) %>%
gf_freqpoly( ~ age, binwidth = 5) %>%
gf_labs(x = "age (years)", title = "Age of runners") %>%
gf_lims(x = c(20, 80)) %>%
gf_theme(theme = theme_minimal)
## Warning: Removed 84 rows containing non-finite values (stat_bin).
## Warning: Removed 84 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_path).
gf_histogram( ~ age, data = Runners, alpha = 0.5, fill = "white",
binwidth = 5) %>%
gf_freqpoly( ~ age, color = "skyblue", binwidth = 5, size = 1.5) %>%
gf_labs(x = "age (years)", title = "Age of runners") %>%
gf_lims(x = c(20, 80)) %>%
gf_theme(theme = theme_dark)
## Warning: Removed 84 rows containing non-finite values (stat_bin).
## Warning: Removed 84 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_path).
The gf_
functions generate a ggplot object as well as a character string containing the ggplot()
command to generate the graphic. This can be useful when you want to use the gf_
functions to remind you about how ggplot()
works, but you want to have the ggplot()
commands directly in your document for future modification.
Use verbose = TRUE
to see the string being generated.
gf_jitter(age ~ sex + alpha:0.05, data = Runners, verbose = TRUE)
## ggplot(data = Runners) +
## geom_jitter(aes(y = age, x = sex), alpha = 0.05)