Fitting Distributions to Data

Problem: Suppose we have data for a continuous random variable X. Can we find (estimate) the distribution for X?

Part 1: What type of distribution? Exponential, normal, uniform, gamma, beta,
Part 2: What are the values of the parameters for the distribution?

Three Methods

Method of moments
- Only requires summaries of data, not complete data.
- Can be done by hand in many important cases
- But there are better ways to get estimates (better statistical properties)
Maximum likelihood
- Harder to do by hand, but better estimates
- Requires the full data set, lots of computation, and optimization
- This is what fitdistr() and gf_fitdistr() do)
Kernel density estimate
- Let’s us skip step 1
- This is what gf_dens() and gf_density()
- Big idea: replace each data point with a little curve (a kernel) and add them together.
- Get different results depending on the choice of curve.

All of these methods work better with large data sets than with small data sets – it is hard to determine the overall shape of a distribution without a large data set.

Method of Moments

Big idea:

Express the mean, variance, etc. of the distribution in terms of the parameters of the distribution. (You can look this information up in our book or on Wikipedia.)
Set them equal to the mean, variance, etc. of your data set.
Solve.

Example: A sample has mean 15 and standard deviation 3. Use the method of moments to fit a uniform distribution.

moment	data	distribution
mean	15	\(\frac{a+b}{2}\)
variance	3^2	\(\frac{(b-a)^2}{12}\)

So we need to solve this system of equations:

\[\begin{align*} 15 &= \frac{a+b}{2} \\ 3^2 &= \frac{(b-a)^2}{12} \\ \end{align*}\]

If we let \(m = \frac{a + b}{2}\) and let \(d = b-a\), our equations become

\[\begin{align*} 15 &= m \\ 3^2 &= \frac{d^2}{12} \\ \end{align*}\]

This is much easier. We see that \(d = \sqrt{9 \cdot 12}\). Let let R do the arithmetic.

m <- 15; m

## [1] 15

d <- sqrt(9*12); d

## [1] 10.4

a <- m - d/2; a

## [1] 9.8

b <- m + d/2; b

## [1] 20.2

So \(\hat a = 9.8\), \(\hat b = 20.2\).

Example: A sample has mean 15 and standard deviation 3. Use the method of moments to fit a normal distribution.

This is a very easy case.

moment	data	distribution
mean	15	\(\mu\)
standard deviation	3	\(\sigma\)

So we get \(\hat \mu = 15\) and \(\hat \sigma = 3\).

Example: A sample has mean 15 and standard deviation 3. Use the method of moments to fit an exponential distribution.

In this case, we only need the mean.

moment	data	distribution
mean	15	\(1/ \lambda\)

So \(\hat \lambda = 1/15\).

Example: A sample has mean 15 and standard deviation 3. Use the method of moments to fit an gamma distribution.

moment	data	distribution
mean	15	\(\alpha \beta\)
variance	9	\(\alpha \beta^2\)

So we need to solve \[\begin{align*} 15 &= \alpha \beta \\ 9 &= \alpha \beta^2 \\ \end{align*}\]

Since \(9 =\alpha \beta \beta = 15 \beta\), we have \(\hat\beta = 9/15\), and \(\hat \alpha = \frac{15}{9/15} = \frac{15^2}{9} = 25\).

General Example: A sample has mean = ________ and standard deviation = _________, use the method of moments to fit a ________ distribution.

Might be easy or hard depending on the algebra involved.
If there is only one parameter, we only need to use the mean.
There are ways to use distributions with more than 2 parameters, but we won’t cover that in this class.
No guarantee that the fit is a good fit. You can fit any distribution using any method to any data. Other methods (plots, for example) need to be used to see whether the fit is any good.

Maximum Likelihood (`fitdistr()`)

Example: Wind speed

We can use the following code to load a data set that contains three year’s worth of mean hourly wind speeds (mph) in Twin Falls, ID. This kind of data is often used to estimate how much power could be generated from a windmill placed in a given location.

Wind <- 
  read.csv("https://rpruim.github.io/Engineering-Statistics/data/stob/TwinfallsWind.csv")
head(Wind, 2)

##       date time speed
## 1 1/1/2010 0:00  2.24
## 2 1/1/2010 1:00  2.42

tail(Wind, 2)

##             date  time speed
## 26272 12/31/2012 22:00  3.88
## 26273 12/31/2012 23:00  5.04

gf_histogram( ~ speed, data = Wind, binwidth = 1 )

As we can see, the distribution is skewed, but it doesn’t look like an exponential distribution would be a good fit. Of the distributions we have seen, it seems like a Weibull or Gamma distribution would be a potentially good choice. A Weibull model has often been used as a model for mean hourly wind speed, and the shape of our histogram indicates that this is a reasonable family of distributions.

Which Weibull distribution is the best model for our data?

fitdistr( Wind$speed, "weibull" )

## Error in fitdistr(Wind$speed, "weibull"): Weibull values must be > 0

For fitdistr()() to fit a Weibull distribution, all of the data must be positive, but our data includes some 0’s.

tally( ~ (speed == 0), data = Wind)

## (speed == 0)
##  TRUE FALSE 
##    48 26225

Let’s see how small the smallest non-zero measurements are.

min( ~ speed, data = Wind %>% filter(speed > 0))

## [1] 0.01

This may well be a simple rounding issue, since the wind speeds are recorded to the nearest 0.01 and 0.01 is the smallest positive value. Let’s create a new variable that moves each value of 0 to 0.0025 and try again. Why 0.0025? If we think that 0.01 represents anything in the range 0.005 to 0.015, which would round to 0.01, then 0 represents anything in the range 0 to 0.005.
0.0025 is the middle of that range.

Wind <- Wind %>% mutate(speed2 = ifelse( speed > 0, speed, 0.0025))
fitdistr( Wind$speed2, "weibull" )

##     shape     scale 
##   1.69442   6.65059 
##  (0.00796) (0.02555)

This says that the best fitting (in the sense of maximum likelihood) Weibull distribution is the \({\sf Weibull}(1.69, 6.65)\)-distribution.

Let’s see how well it fits:

gf_dhistogram( ~ speed2, data = Wind) %>%
  gf_fitdistr( ~ speed2, data = Wind, dist = "weibull")

This can be abbreviated a bit:

gf_dhistogram( ~ speed2, data = Wind) %>%
  gf_fitdistr(dist = "weibull")

Some Practice

1. Suppose a data set has a sample mean of 6 and a standard deviation of 2. Use the method of moments to compute estimates of the shape and rate paramters for a gamma distribution.

2. Fit a gamma distribution to the wind speed data. Use both the method of moments and fitdistr(). How do the gamma and Weibull fits compare?

3. The weight variable in the ChickWeight data frame contains the weights (g) of baby chicks at different times under different feeds.

Plot a histogram of the weights.
Of the distributions: uniform, exponential, normal, and gamma, which look like they might fit the data well?

Use fitdistr() to fit the type of distribution in (b) to the data. What is the estimate of the parameter(s) for the distribution?

Plot the histogram with the fitted density function.
Use fitdistr() to fit one of the other types of distributions from (b) and plot the histogram with the density curve for that type of distribution. How does the fit look?

4. This problem use the Womenlf data set from the car package. The hincome variable contains household income for 263 Canadian women.

Plot a histogram of these incomes.
Use the method of moments to fit a normal distribution to these data.
Use fitdistr() to fit a normal distribution to these data. How does this compare to (b)?
Plot the histogram with the fitted normal density function.
Now fit a gamma distribution to this data. What are the estimates for the shape and rate of the fitted gamma distribution?
Plot the histogram with the fitted gamma density function.
Compare the quality of the fits with normal vs gamma. Does one clearly fit better than the other?

5. The dataframe morley contains 100 measurements that Morley made of the speed of light. These values are contained in the column Speed.

Plot the histogram of this data.
Which kind of distribution best fits the data?
Find the estimated parameters for this distribution.
Plot the histogram with the fitted density curve.

Fitting Distributions to Data

Stat 241

Method of Moments

In this case, we only need the mean.

Maximum Likelihood (fitdistr())

Some Practice

Maximum Likelihood (`fitdistr()`)