Transformations and combinations allow us to create new random variables from old.
Toss a pair of fair dice. Let \(X\) be the result on the first die and \(Y\) the result on the second die.
Then \(S = X + Y\) is the sum of the two dice and \(P = XY\) is the product.
Let \(F\) be the temperature of a randomly selected object in degrees Fahrenheit.
Then \(C = 5/9(F-32)\) is the object’s temperature in degrees Celsius.
A random sample of \(n\) adult females is chosen. Let \(X_i\) be the height (in inches) of the \(i\)th person in the sample. Then \(\overline{X} = \frac{X_1 + X_2 + \cdots X_n}{n}\), is the sample mean.
General Question: If \(X\) is the result of combining two or more known random variables or of transforming a single random variable, what can we know about the distribution of \(X\)?
0. \(\operatorname{Var}(X) = \operatorname{E}(X^2) – \operatorname{E}(X)^2\)
1a. \(\operatorname{E}(X+b) = E(X) + b\) and \(\operatorname{Var}(X+b) = \operatorname{Var}(X)\).
1b. \(\operatorname{E}(aX) = aE(X)\) and \(\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)\).
1. \(\operatorname{E}(aX + b) = a \operatorname{E}(X) + b\)
2. \(\operatorname{E}(X+Y) = \operatorname{E}(X) + \operatorname{E}(Y)\).
3. If \(X\) and \(Y\) are independent, then \(\operatorname{E}(XY) = \operatorname{E}(X)\operatorname{E}(Y)\).
4. If \(X\) and \(Y\) are independent, then \(\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)\).
These rules are not too difficult to prove, but we will focus mainly an how to use the rules.
Here are two example proofs (for the continuous case). We will omit the limits of integration to focus attention on the use of simple rule for integration.
\[ \operatorname{E}(X + b) = \int (x + b) f(x) \; dx = \int x f(x) \;dx + b \int f(x) \; dx = \operatorname{E}(X) + b \]
\[ \operatorname{E}(aX) = \int ax f(x) \; dx = a \int x f(x) \;dx = a \operatorname{E}(X) \] The rules involving more than one random variable require double integrals, but they are also straightforward. The rules for discrete random variables involve sums (and double sums) instead of integrals.
\(X\) and \(Y\) are independent random variables if the distribution of \(X\) is the same for each value of \(Y\) and vice versa. This is equivalent to saying that
\[ \operatorname{P}(X \le x \operatorname{and}Y \le y) = \operatorname{P}(X \le x) \cdot \operatorname{P}(Y \le y) \] for all \(x\) and \(y\).
If a pair of fair dice are tossed, \(X\) is the value of the first die, and \(Y\) is the value of the second, then \(X\) and \(Y\) are independent.
If \(X\) and \(Y\) are the height and weight of a randomly selected person, then \(X\) and \(Y\) are not independent. (The distribution of weights is different for taller people compared to the distribution for shorter people.)
In the examples below we will compare simulations to the rules above.
Two Dice
Let \(X\) and \(Y\) be the values on two fair dice. Look at the sum and product: \(X+Y\) and \(XY\).
TwoDice <-
tibble(
die1 = resample(1:6, 10000),
die2 = resample(1:6, 10000),
S = die1 + die2,
P = die1 * die2
)
mean(~ die1, data = TwoDice)
## [1] 3.4908
mean(~ die2, data = TwoDice)
## [1] 3.5356
mean(~ S, data = TwoDice)
## [1] 7.0264
mean(~ P, data = TwoDice)
## [1] 12.3483
gf_histogram(~ S, data = TwoDice, binwidth = 1)
gf_histogram(~ P, data = TwoDice, binwidth = 1)
Uniform
Let \(X\) and \(Y\) be independent random variables that have the uniform distribution on [0,1]. Again, let’s look at the sum and product.
TwoUnif <-
tibble(
X = runif(10000, 0, 1),
Y = runif(10000, 0, 1),
S = X + Y,
P = X * Y
)
mean(~ S, data = TwoUnif)
## [1] 1.006544
mean(~ P, data = TwoUnif)
## [1] 0.2525068
gf_histogram(~ S, data = TwoUnif)
gf_histogram(~ P, data = TwoUnif)
Build your own examples
You can do a similar thing with any distributions you like. It even works if \(X\) and \(Y\) have different distributions!
We can combine the rules above to build two more rules.
5. \(\operatorname{E}(a_1 X_1 + a_2 X_2 + \cdots a_n X_n) = a_1 \operatorname{E}(X_1) + a_2 \operatorname{E}(X_2) + \cdots a_n \operatorname{E}(X_n)\)
6. If \(X_1, X_2, \dots, X_n\) are indepenent, then \[ \operatorname{Var}(a_1 X_1 + a_2 X_2 + \cdots a_n X_n) = a_1^2 \operatorname{Var}(X_1) + a_2^2 \operatorname{Var}(X_2) + \cdots a_n^2 \operatorname{Var}(X_n) \]
\[ \operatorname{SD}(a_1 X_1 + a_2 X_2 + \cdots a_n X_n) = \sqrt{ a_1^2 \operatorname{SD}(X_1)^2 + a_2^2 \operatorname{SD}(X_2)^2 + \cdots a_n^2 \operatorname{SD}(X_n)^2 } \]
Fact 1 Any linear transformation of a normal random variable is normal.
Fact 2 Any linear combination of independent normal random variables is normal
Example
If \(X \sim {\sf Norm}(1,1)\), \(Y \sim {\sf Norm}(-1,2)\), \(W \sim {\sf Norm}(5,4)\), and \(X,Y,W\) are independent, find the distribution of \(C = 2X + 3Y + W\).
By Fact 2, \(C\) will be normal. We just need to work out the mean and variance.
Let’s compare to some simulations
NormalExample <-
tibble(
X = rnorm(10000, 1, 1),
Y = rnorm(10000, -1, 2),
W = rnorm(10000, 5, 4),
C = 2 * X + 3 * Y + W
)
df_stats(~ C, data = NormalExample, mean, sd)
## response mean sd
## 1 C 4.045686 7.519876
gf_dhistogram( ~ C, data = NormalExample) %>%
gf_fitdistr(~ C, data = NormalExample, "norm")
We used gf_fitdistr()
to overlay a normal density curve on our density histogram above. What is that doing?
It is based on a function called fitdistr()
which (as you can probably guess) fits distributions to data. Basically, fitdistr()
is looking for the best fitting distribution in a given family for a given data set. fitdistr()
is a little bit awkward to use becuase
But otherwise it is simple to use. Here it tells us what it thinks are the best fitting normal distributions in our previous example.
fitdistr(NormalExample$X, "normal")
## mean sd
## 0.991422114 1.007772233
## (0.010077722) (0.007126026)
fitdistr(NormalExample$Y, "normal")
## mean sd
## -0.97969889 1.99990948
## ( 0.01999909) ( 0.01414150)
fitdistr(NormalExample$W, "normal")
## mean sd
## 5.00193842 3.99621302
## (0.03996213) (0.02825749)
fitdistr(NormalExample$C, "normal") # this is the one that was added to our plot
## mean sd
## 4.04568598 7.51949978
## (0.07519500) (0.05317089)
fitdistr()
work?There are two important ways to fit distributions to data:
Choose the parameter values that make the mean or mean and standard deviation of the data equal to the mean and standard deviation of the distribution.
fitdistr()
uses the maximum likelihood method – numerical optimization. But in the case of a normal distrbution, both methods give the same result (up to some round off for the numerical optimization).
fitdistr(NormalExample$C, "normal") # this is the one that was added to our plot
## mean sd
## 4.04568598 7.51949978
## (0.07519500) (0.05317089)
# method of moments
df_stats(~ C, data = NormalExample, mean, sd)
## response mean sd
## 1 C 4.045686 7.519876
fitdistr()
knows about several other families of distributions as well. See ?fitdistr
for the list and the names fitdistr()
uses. Note that or some distributions, you will need to provide a reasonable starting guess for the parameters values.
Fact 3 If \(X_1, X_2, \dots X_n\) are independent random variables that have the same distribution, then the sum will be approximately normal, no matter what distribution \(X_i\) has, provided \(n\) is “large enough”.
The approximation gets better and better as \(n\) increases.
Simulation
sim1 <- runif(10000, 0, 1)
sim2 <- runif(10000, 0, 1)
sim3 <- runif(10000, 0, 1)
sim4 <- runif(10000, 0, 1)
sim5 <- runif(10000, 0, 1)
sim6 <- runif(10000, 0, 1)
sum2 <- sim1 + sim2
sum4 <- sim1 + sim2 + sim3 + sim4
sum6 <- sim1 + sim2 + sim3 + sim4 + sim5 + sim6
gf_dhistogram( ~ sum2, title = "Sum of 2 uniform random variables")
gf_dhistogram( ~ sum4, title = "Sum of 4 uniform random variables")
gf_dhistogram( ~ sum6, title = "Sum of 6 uniform random variables")
It takes longer for an exponential distribution, but it still converges to normal pretty quickly.
sim1 <- rexp(10000, 1)
sim2 <- rexp(10000, 1)
sim3 <- rexp(10000, 1)
sim4 <- rexp(10000, 1)
sim5 <- rexp(10000, 1)
sim6 <- rexp(10000, 1)
sim7 <- rexp(10000, 1)
sim8 <- rexp(10000, 1)
sum2 <- sim1 + sim2
sum4 <- sim1 + sim2 + sim3 + sim4
sum6 <- sim1 + sim2 + sim3 + sim4 + sim5 + sim6
sum8 <- sim1 + sim2 + sim3 + sim4 + sim5 + sim6 + sim7 + sim8
gf_dhistogram( ~ sum2, title = "Sum of 2 exponential random variables")
gf_dhistogram( ~ sum4, title = "Sum of 4 exponential random variables")
gf_dhistogram( ~ sum6, title = "Sum of 6 exponential random variables")
gf_dhistogram( ~ sum8, title = "Sum of 8 exponential random variables")
Often we are interested the mean of something. The mean can be written as
\[ \overline X = \frac{X_1 + X_2 + X_3 + \cdots X_n}{n} \]
If each \(X_i\) is randomly selected from the same population, then the numerator is a sum of independent and identially distributed (iid) random variables, so…
Our rules tell us the mean and variance (and standard deviation) of \(\overline {X}\).
\(\overline{X}\) will be approximately normal, provided our sample is large enough.
This means we can approximate probability about \(\overline{X}\), no matter what distribution the \(X_i\)’s come from!
1. If each \(X_i\) has a mean of \(\mu\) and a standard deviation of \(\sigma\), and the \(X_i\) are independent, fill in the question marks below:
\[ \overline{X} \approx {\sf Norm}(?, ?) \]
This is arguably the most important result in all of statistics and is referred to as the Central Limit Theorem.
2. If \(X\) and \(Y\) are independent random variables, \(\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)\). What about \(\operatorname{Var}(X-Y)\)? Your first guess might be that \(\operatorname{Var}(X-Y) = \operatorname{Var}(X) – \operatorname{Var}(Y)\). But this cannot be true, since if \(\operatorname{Var}(Y) > \operatorname{Var}(X)\) we would have \(\operatorname{Var}(X-Y) < 0\), but variance is always non-negative.
What is the correct rule? [Hint: use the rules we have.]
\[ \operatorname{Var}(X - Y) = ?? \]
3. Suppose \(X \sim {\sf Norm}(10,3)\), \(Y \sim {\sf Norm}(6,2)\), and \(X\) and \(Y\) are independent.
4. Suppose \(X\) is a random variable with mean = 6 and sd = 2, \(Y\) is a random variable with mean = -1 and sd = 2, and \(X\) and \(Y\) are independent.
What are the mean and standard deviation of \(2X-Y\)?
What are the mean and standard deviation of \(3X+4Y\)?
5. Let \(X\) and \(Y\) be independent \({\sf Gamma}(shape = 3, rate = 4)\) random variables.
Let \(S = X+Y\) and \(D = X-Y\). Is a gamma distribution a good fit for S? Use a simulation with 10000 repetitions to test this. For each (sum and difference):
gf_fitdistr()
to compare your histogram to the best fitting Gamma distribution.fitdistr()
to get the shape and rate parameters for the best fit.6. So, it looks like the sum of two independent gamma random variables with the same shape and scale also has a gamma distribution.
We would not expect the difference to have a gamma distribution, since values or D can be negative and a gamma RV cannot be negative. Use simulations to see whehter it looks like \(D\) is normal.
7. Suppose \(X \sim {\sf Gamma}(3, 4)\) and \(W \sim {\sf Gamma}(1,5)\) and \(W\) is independent of \(X\). Use simulations to check whether it is reasonable to conclude that \(X+W\) has a gamma distribution.
8. Another important distribution in applications is the Chi-squared distribution. The Chi-squared distribution is a special case of the Gamma distribution. If \(n\) is a positive integer, then \(X\) has a Chi-square distribution with \(n\) degrees of freedom if \(X\) has a Gamma distribution with shape = \(n/2\) and rate = \(1/2\), i.e., \({\sf Chisq}(n) = {\sf Gamma}(n/2,1/2)\). The abbreviation for Chi-squared in R is chisq
.
Let \(Z\) be the standard normal random variable; i.e., \(Z \sim Norm(0,1)\). What kind of distribution does \(Z^2\) have? One of the following is true about \(Z^2\). * It has a normal distribution or * It has a Chi-squared distribution with 1 degree of freedom. Use a simulations to determine which of these two alternatives is correct.