Old Faithful

data(geyser, package = "MASS")
Geyser2 <- geyser %>% filter( ! duration %in% c(2.0, 3.0, 4.0))

Mixture of Normals with modified data

# load dmix() and loglik.faithful() from text
snippet("faithful-mle01")
## 
## ## snippet: faithful-mle01
## 
## > # density function for mixture of normals
## > dmix <- function(x, alpha, mu1, mu2, sigma1, sigma2) {
## +   if (alpha < 0 || alpha > 1) return (NA)
## +   if (sigma1 < 0 || sigma2 < 0) return (NA)
## +   alpha * dnorm(x, mu1, sigma1) + (1-alpha) * dnorm(x, mu2, sigma2)
## + }
snippet("faithful-mle02")
## 
## ## snippet: faithful-mle02
## 
## > # log-likelihood
## > loglik.faithful <- function(theta, x) {
## +   alpha <- theta[1]  
## +   mu1 <- theta[2]; mu2 <- theta[3]
## +   sigma1 <- theta[4]; sigma2 <- theta[5]
## +   
## +   sum(log(dmix(x, alpha, mu1, mu2, sigma1, sigma2)))
## + }
# with original data
ml0 <- 
  maxLik(loglik.faithful, 
       x = geyser$duration, 
       start = c(alpha = 0.3, mu1 = 2, mu2 = 5, sd1 = 0.4,  sd2 = 0.2 ) 
)
mle0 <- coef(ml0)
ml0
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 10 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -298.1438 (5 free parameter(s))
## Estimate(s): 0.3395497 1.950493 4.237297 0.228234 0.431287
# with reduced data
ml1 <- 
  maxLik(loglik.faithful, 
       x = Geyser2$duration, 
       start = c(alpha = 0.3, mu1 = 2, mu2 = 5, sd1 = 0.4,  sd2 = 0.2 ) 
)
mle1 <- coef(ml1)
ml1
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 11 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -232.953 (5 free parameter(s))
## Estimate(s): 0.3699113 1.97203 4.377528 0.3093057 0.3949012

Comparing the two mixture of normal models

The new model fits the reduced data well, but how does it compare to the original model on the full data?

gf_dhistogram( ~ duration, data = geyser, binwidth = 0.20, alpha = 0.5) %>% 
  gf_function(fun = dmix, color = "blue",
              args = list(
                alpha =  mle1["alpha"],
                mu1 = mle1["mu1"], mu2 = mle1["mu2"], 
                sigma1 = mle1["sd1"], sigma2 = mle1["sd2"])
  ) %>%
  gf_function(fun = dmix, color = "red",
              args = list(
                alpha =  mle0["alpha"],
                mu1 = mle0["mu1"], mu2 = mle0["mu2"], 
                sigma1 = mle0["sd1"], sigma2 = mle0["sd2"])
  ) 

The relative proportions of long and short eruptions shift a bit and the standard deviation of the shorter eruption times increases when we remove the integer eruption times. Some of this is a distortion from the lost information about eruptions that were near – but not exactly – 2, 3, or 4 minutes long. (A better model would continue to use the erruptions of length 2, 3 or 4 in some way.)

Mixture of Gammas

dmix2 <- function(x, alpha, shape1, lambda1, shape2, lambda2) {
  if (alpha < 0 || alpha > 1) return (NA)
  if (lambda1 < 0 || lambda2 < 0) return (NA)
  if (shape1 < 0 || shape2 < 0) return (NA)
  alpha * dgamma(x, shape = shape1, rate = lambda1) + 
    (1-alpha) * dgamma(x, shape = shape2, rate = lambda2)
}

loglik.faithful2 <- function(theta, x) {
  sum(log(dmix2(x, theta[1], theta[2], theta[3], theta[4], theta[5])))
}

ml2 <-
  maxLik(loglik.faithful2, x = Geyser2$duration, 
         start = c(alpha = 0.3, shape1 = 100, lambda1 = 30, 
                   shape2 = 400, lambda2 = 90)
  )
mle2 <- coef(ml2)
ml2
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 18 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: -234.9234 (5 free parameter(s))
## Estimate(s): 0.3724042 37.86704 19.12808 123.5793 28.198

Comparison of the two models

The fits are similar, but not identical. Gamma distributions with large shape parameters are approximately normal, so this isn’t so surprising. The center, spread, and amount of skew for the Gamma distributions all depend on both parameters, however, whereas in the normal distributions the mean and stadard deviation can be selected independently and the distributions must be symmetric.

gf_dhistogram( ~ duration, data = Geyser2, binwidth = 0.20, alpha = 0.5) %>% 
  gf_function(fun = dmix, color = "blue",
              args = list(
                alpha =  mle1["alpha"],
                mu1 = mle1["mu1"], mu2 = mle1["mu2"], 
                sigma1 = mle1["sd1"], sigma2 = mle1["sd2"])
  ) %>%
  gf_function(fun = dmix2, color = "red",
              args = list(
                alpha =  mle2["alpha"],
                shape1 = mle2["shape1"], lambda1 = mle2["lambda1"], 
                shape2 = mle2["shape2"], lambda2 = mle2["lambda2"])
  )