We will use the babynames
data to illustrate
dplyr
to do some data tranformationsHere’s what the first few rows of babynames
look like.
head(babynames, 5)
## # A tibble: 5 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
gf_line( prop ~ year, data = babynames %>% filter(name == "John"), color = ~ sex)
We might want to try some different names. Rather than copy and paste our code, let’s write a function that take the name we are interested in as input. We’ll add a title so we can tell which name we used.
baby_plot <- function(name_to_display = "John") {
gf_line(prop ~ year,
data = babynames %>% filter(name == name_to_display),
color = ~ sex) %>%
gf_labs(title = name_to_display)
}
baby_plot("Steve")
It would be nice to be able to do multiple names at once. Let’s adjust our function a bit to handle this. The main chages are
%in%
to test if a name is among a list of namesbaby_plot2 <- function(name_to_display = "John") {
gf_line(prop ~ year,
data = babynames %>% filter(name %in% name_to_display),
color = ~ sex) %>%
gf_facet_grid(name ~ ., scales = "free")
}
baby_plot2(c("Adrian", "Hillary", "Leslie"))
Here’s a variation on the theme.
baby_plot3 <- function(name_to_display = "John") {
gf_line(prop ~ year,
data = babynames %>% filter(name %in% name_to_display),
color = ~ name) %>%
gf_facet_grid(sex ~ ., scales = "free")
}
baby_plot3(c("Adrian", "Hillary", "Leslie"))
You can continue to modify plots after they are produced by a function
baby_plot3(c("Adrian", "Hillary", "Leslie")) %>%
gf_labs(title = "A plot with a title", caption = "Source: US Census Bureau")
Now let’s combine several related names and make a plot showing them as one group.
babynames %>%
filter(name %in% c("Robert", "Bob", "Rob", "Bobby", "Robbie", "Bert", "Bobbie", "Robby")) %>%
group_by(year) %>%
summarise(total_prop = sum(prop)) %>%
gf_line(total_prop ~ year)
Notice:
filter()
selects the names we are interested ingroup_by()
causes subsequent computation to be done within yearssummarise()
collapses each group (year) to a single rowgf_line()
if we prefer.babynames %>%
filter(name %in% c("Robert", "Bob", "Rob", "Bobby", "Robbie", "Bert", "Bobbie", "Robby")) %>%
gf_area(prop ~ year, fill = ~name, group = ~name,
alpha = 0.5, position = "stack") %>%
gf_facet_grid( sex ~ ., scales = "free")
Suppose we want to explore a likelihood function assuming sampling from a beta distribution.
Let’s start be generating some random data from a beta distribution. (You might do this as part of a simulation; otherwise you might have some actual data to use.)
set.seed(123) # so we get the same random data each time we build the file.
mydata <- rbeta(40, shape1 = 1, shape2 = 2)
gf_dhistogram( ~ mydata, binwidth = 0.1) %>%
gf_dist("beta", shape1 = 1, shape2 = 2, color = "red") %>%
gf_lims(x = c(0,1))
The likelihood function is the product of pdfs. Here’s a way to create it as a function.
mylik1 <- function(a, b, x) {
prod(dbeta(x, shape1 = a, shape2 = b))
}
This works as long as we only want to use one value of a
and of b
:
mylik1(a = 2, b = 3, x = mydata)
## [1] 117.5144
mylik1(a = 1, b = 2, x = mydata)
## [1] 4620.433
But we want to apply this to many values of a
and b
, and our function needs a little bit of help to do that properly.
# this is NOT doing what we want
mylik1(a = c(2,1), b = c(3,2), x = mydata)
## [1] 274.152
The problem is that our function is not vectorized properly. We can use Vectorize()
to tell it how we want vectorization done. Now we get the equivalent of running mylik1()
twice, once with each combination of a
and b
.
mylik <- Vectorize(mylik1, c("a", "b"))
mylik(a = c(2,1), b = c(3,2), x = mydata)
## [1] 117.5144 4620.4329
Grid <-
expand.grid(a = seq(0.7, 1.8, by = 0.01),
b = seq(1.7, 2.8, by = 0.01)) %>%
mutate(likelihood = mylik(a, b, mydata))
Grid %>%
arrange(-likelihood) %>%
top_n(3)
## Selecting by likelihood
## a b likelihood
## 1 1.26 2.50 8714.891
## 2 1.26 2.51 8712.490
## 3 1.26 2.49 8710.811
One way to visualize values on a grid is is with gf_tile()
and/or gf_contour()
.
gf_tile(likelihood ~ a + b, data = Grid) %>%
gf_contour(likelihood ~ a + b, color = "white", data = Grid) %>%
gf_point(b ~ a, color = "yellow",
data = Grid %>% arrange(-likelihood) %>% top_n(1))
## Selecting by likelihood
Sometimes it helps to see things in slow motion. Let’s define a new function where it is easier for us to the calculations by hand.
fun1 <- function(a, b, x) {
sum(a * b * x)
}
fun1(1, 2, x = 1:4)
## [1] 20
1 * 2 * 1 + 1 * 2 * 2 + 1 * 2 * 3 + 1 * 2 * 4
## [1] 20
fun1(2, 3, x = 1:4)
## [1] 60
2 * 3 * 1 + 2 * 3 * 2 + 2 * 3 * 3 + 2 * 3 * 4
## [1] 60
# a, b, and x are all used in sequence, recycling the shorter vectors
fun1(1:2, 2:3, x = 1:4)
## [1] 44
1 * 2 * 1 + 2 * 3 * 2 + 1 * 2 * 3 + 2 * 3 * 4
## [1] 44
Now let’s vectorize.
fun <- Vectorize(fun1, c("a", "b"))
# now each combo of a and b is used with all of x and we get multiple results
fun(1:2, 2:3, x = 1:4)
## [1] 20 60
1 * 2 * 1 + 1 * 2 * 2 + 1 * 2 * 3 + 1 * 2 * 4
## [1] 20
2 * 3 * 1 + 2 * 3 * 2 + 2 * 3 * 3 + 2 * 3 * 4
## [1] 60
Herea are three ways to use log scales. For a simple plot, they are usually roughly equivalent, but if the plot does some computation on the data, they can differ because the computation is done a different point in the process.
gf_point(log10(dist) ~ log10(speed), data = cars) %>%
gf_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
# here the smoothing happens first, then the log transformation
gf_point(dist ~ speed, data = cars) %>%
gf_smooth(se = FALSE) %>%
gf_refine(coord_trans(x = "log2", y = "log2"))
## `geom_smooth()` using method = 'loess'
# here the log transformation happens first, then the smoothing
gf_point(dist ~ speed, data = cars) %>%
gf_smooth(se = FALSE) %>%
gf_refine(
scale_x_continuous(trans = "log2"),
scale_y_continuous(trans = "log2")
)
## `geom_smooth()` using method = 'loess'