Baby names

We will use the babynames data to illustrate

babynames data

Here’s what the first few rows of babynames look like.

head(babynames, 5)
## # A tibble: 5 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179

A plot

gf_line( prop ~ year, data = babynames %>% filter(name == "John"), color = ~ sex)

A function that creates a plot

We might want to try some different names. Rather than copy and paste our code, let’s write a function that take the name we are interested in as input. We’ll add a title so we can tell which name we used.

baby_plot <- function(name_to_display = "John") {
  gf_line(prop ~ year, 
          data = babynames %>% filter(name == name_to_display), 
          color = ~ sex) %>%
    gf_labs(title = name_to_display)
}
baby_plot("Steve")

Multiple names at one go

It would be nice to be able to do multiple names at once. Let’s adjust our function a bit to handle this. The main chages are

  • the use of %in% to test if a name is among a list of names
  • adding faceting to label the plot.
  • removing the title
baby_plot2 <- function(name_to_display = "John") {
  gf_line(prop ~ year, 
          data = babynames %>% filter(name %in% name_to_display), 
          color = ~ sex) %>%
    gf_facet_grid(name ~ ., scales = "free")
}
baby_plot2(c("Adrian", "Hillary", "Leslie"))

Here’s a variation on the theme.

baby_plot3 <- function(name_to_display = "John") {
  gf_line(prop ~ year, 
          data = babynames %>% filter(name %in% name_to_display), 
          color = ~ name) %>%
    gf_facet_grid(sex ~ ., scales = "free")
}
baby_plot3(c("Adrian", "Hillary", "Leslie"))

Modifying plots that come from functions

You can continue to modify plots after they are produced by a function

baby_plot3(c("Adrian", "Hillary", "Leslie")) %>%
  gf_labs(title = "A plot with a title", caption = "Source: US Census Bureau") 

Combining names

Now let’s combine several related names and make a plot showing them as one group.

babynames %>% 
  filter(name %in% c("Robert", "Bob", "Rob", "Bobby", "Robbie", "Bert", "Bobbie", "Robby")) %>%
  group_by(year) %>%
  summarise(total_prop = sum(prop)) %>%
  gf_line(total_prop ~ year) 

Notice:

  • filter() selects the names we are interested in
  • group_by() causes subsequent computation to be done within years
  • summarise() collapses each group (year) to a single row
  • instead of saving the modified data, we can just pipe it directly into gf_line() if we prefer.

Letting the plot do the work

babynames %>% 
  filter(name %in% c("Robert", "Bob", "Rob", "Bobby", "Robbie", "Bert", "Bobbie", "Robby")) %>%
  gf_area(prop ~ year, fill = ~name, group = ~name, 
          alpha = 0.5, position = "stack") %>%
  gf_facet_grid( sex ~ ., scales = "free")

Numerical functions

Suppose we want to explore a likelihood function assuming sampling from a beta distribution.

Some random data

Let’s start be generating some random data from a beta distribution. (You might do this as part of a simulation; otherwise you might have some actual data to use.)

set.seed(123)  # so we get the same random data each time we build the file.
mydata <- rbeta(40, shape1 = 1, shape2 = 2)
gf_dhistogram( ~ mydata, binwidth = 0.1) %>%
  gf_dist("beta", shape1 = 1, shape2 = 2, color = "red") %>%
  gf_lims(x = c(0,1)) 

The likelihood function

The likelihood function is the product of pdfs. Here’s a way to create it as a function.

mylik1 <- function(a, b, x) {
  prod(dbeta(x, shape1 = a, shape2 = b))
}

This works as long as we only want to use one value of a and of b:

mylik1(a = 2, b = 3, x = mydata)
## [1] 117.5144
mylik1(a = 1, b = 2, x = mydata)
## [1] 4620.433

But we want to apply this to many values of a and b, and our function needs a little bit of help to do that properly.

# this is NOT doing what we want
mylik1(a = c(2,1), b = c(3,2), x = mydata)
## [1] 274.152

Vectorize() to the rescue

The problem is that our function is not vectorized properly. We can use Vectorize() to tell it how we want vectorization done. Now we get the equivalent of running mylik1() twice, once with each combination of a and b.

mylik <- Vectorize(mylik1, c("a", "b"))
mylik(a = c(2,1), b = c(3,2), x = mydata)
## [1]  117.5144 4620.4329

Using a grid to search for the maximum likelihood

Grid <-
  expand.grid(a = seq(0.7, 1.8, by = 0.01), 
              b = seq(1.7, 2.8, by = 0.01)) %>%
  mutate(likelihood = mylik(a, b, mydata)) 

Grid %>%
  arrange(-likelihood) %>%
  top_n(3)
## Selecting by likelihood
##      a    b likelihood
## 1 1.26 2.50   8714.891
## 2 1.26 2.51   8712.490
## 3 1.26 2.49   8710.811

Visualizing with gf_tile() and gf_contour()

One way to visualize values on a grid is is with gf_tile() and/or gf_contour().

gf_tile(likelihood ~ a + b, data = Grid) %>%
  gf_contour(likelihood ~ a + b, color = "white", data = Grid) %>%
  gf_point(b ~ a, color = "yellow",
           data = Grid %>% arrange(-likelihood) %>% top_n(1))
## Selecting by likelihood

Vectorize() in slow motion

Sometimes it helps to see things in slow motion. Let’s define a new function where it is easier for us to the calculations by hand.

fun1 <- function(a, b, x) {
  sum(a * b * x)
}


fun1(1, 2, x = 1:4)
## [1] 20
1 * 2 * 1 + 1 * 2 * 2 + 1 * 2 * 3 + 1 * 2 * 4
## [1] 20
fun1(2, 3, x = 1:4)
## [1] 60
2 * 3 * 1 + 2 * 3 * 2 + 2 * 3 * 3 + 2 * 3 * 4
## [1] 60
# a, b, and x are all used in sequence, recycling the shorter vectors
fun1(1:2, 2:3, x = 1:4)
## [1] 44
1 * 2 * 1 + 2 * 3 * 2 + 1 * 2 * 3 + 2 * 3 * 4
## [1] 44

Now let’s vectorize.

fun <- Vectorize(fun1, c("a", "b"))

# now each combo of a and b is used with all of x and we get multiple results
fun(1:2, 2:3, x = 1:4)
## [1] 20 60
1 * 2 * 1 + 1 * 2 * 2 + 1 * 2 * 3 + 1 * 2 * 4
## [1] 20
2 * 3 * 1 + 2 * 3 * 2 + 2 * 3 * 3 + 2 * 3 * 4
## [1] 60

Log Scales

Herea are three ways to use log scales. For a simple plot, they are usually roughly equivalent, but if the plot does some computation on the data, they can differ because the computation is done a different point in the process.

gf_point(log10(dist) ~ log10(speed), data = cars) %>%
  gf_smooth(se = FALSE) 
## `geom_smooth()` using method = 'loess'

# here the smoothing happens first, then the log transformation
gf_point(dist ~ speed, data = cars) %>%
  gf_smooth(se = FALSE) %>%
  gf_refine(coord_trans(x = "log2", y = "log2"))
## `geom_smooth()` using method = 'loess'

# here the log transformation happens first, then the smoothing
gf_point(dist ~ speed, data = cars) %>%
  gf_smooth(se = FALSE) %>%
  gf_refine(
    scale_x_continuous(trans = "log2"),
    scale_y_continuous(trans = "log2")
  ) 
## `geom_smooth()` using method = 'loess'