3 Probability

1. Suppose a random variable has the pdf \(p(x) = 6x (1-x)\) on the interval \([0,1]\). (That means it is 0 outside of that interval.)

Use function() to create a function p in R that is equivalent to \(p(x)\).
Use gf_function() or gf_fun() to plot the function on an interval a little wider than the interval \([0, 1]\) (so you can be sure it is doing the right thing outside that interval).
Integrate by hand to show that the total area under the pdf is 1 (as it should be for any pdf).
Now have R compute that same integral (using integrate()).
What is the largest value of \(p(x)\)? At what value of \(x\) does it occur? Is it a problem that this value is larger than 1?

Hint: differentiation might be useful.

2. Recall that \(\operatorname{E}(X) = \int x f(x) \;dx\) for a continuous random variable with pdf \(f\) and \(\operatorname{E}(X) = \sum x f(x) \;dx\) for a discrete random variable with pmf \(f\). (The integral or sum is over the support of the random variable.) Compute the expected value for the following random variables.

\(A\) is discrete with pmf \(f(x) = x/10\) for \(x \in \{1, 2, 3, 4\}\).
\(B\) is continuous with kernel \(f(x) = x^2(1-x)\) on \([0, 1]\).

Hint: first figure out what the pdf is.

3. Compute the variance and standard deviation of each of the distributions in the previous problem.

4. In Bayesian inference, we will often need to come up with a distribution that matches certain features that correspond to our knowledge or intuition about a situation. Find a normal distribution with a mean of 10 such that half of the distribution is within 3 of 10 (ie, between 7 and 13).

Hint: use qnorm() to determine how many standard deviations are between 10 and 13.

5. School children were surveyed regarding their favorite foods. Of the total sample, 20% were 1st graders, 20% were 6th graders, and 60% were 11th graders. For each grade, the following table shows the proportion of respondents that chose each of three foods as their favorite.

a. From that information, construct a table of joint probabilities of grade and favorite food. 

a. Are grade and favorite food independent?  Explain how you got your answer. 

    grade | Ice cream  | Fruit       | French fries
    :----:|:----------:|:-----------:|:------------:
    1st   | 0.1        | 0.1         | 0.6
    6th   | 0.3        | 0.6         | 0.3
    11th  | 0.6        | 0.3         | 0.1

6. Alice has 3 hats labeled with the letters H, A, and T. In each hat are marbles of various colors.

Hat	White marbles	Red marbles	Yellow marbles
H	4	10	6
A	6	12	2
T	5	3	2

Alice randomly selects a hat by flipping two coins. If both are heads, she chooses hat H. If both are tails, she chooses hat T. If there is one head and one tail, she chooses hat A. Once that hat is selected, she draws out two marbles.

If the two marbles are both white, what is the probability that the hat was hat A?
If there is one red marble and one yellow marble, what is the probability that the hat was hat A?
If the two marbles are the same color, what is the probability that the hat was hat A?

7. More testing.

Suppose that the population consists of 100,000 people. Compute how many people would be expected to fall into each cell of Table 5.4 on page 104 of DBDA2e. (To compute the expected number of people in a cell, just multiply the cell probability by the size of the population.)

You should find that out of 100,000 people, only 100 have the disease, while 99,900 do not have the disease. These marginal frequencies instantiate the prior probability that \(p(\theta = \frown) = 0.001\). Notice also the cell frequencies in the column \(\theta = \frown\), which indicate that of 100 people with the disease, 99 have a positive test result and 1 has a negative test result. These cell frequencies instantiate the hit rate of 0.99. Your job for this part of the exercise is to fill in the frequencies of the remaining cells of the table.
Take a good look at the frequencies in the table you just computed for the previous part. These are the so-called “natural frequencies” of the events, as opposed to the somewhat unintuitive expression in terms of conditional probabilities (Gigerenzer & Hoffrage, 1995). From the cell frequencies alone, determine the proportion of people who have the disease, given that their test result is positive.

Your answer should match the result from applying Bayes’ rule to the probabilities.
Now we’ll consider a related representation of the probabilities in terms of natural frequencies, which is especially useful when we accumulate more data. This type of representation is called a “Markov” representation by Krauss, Martignon, and Hoffrage (1999). Suppose now we start with a population of \(N = 10,000,000\) people. We expect 99.9% of them (i.e., 9,990,000) not to have the disease, and just 0.1% (i.e., 10,000) to have the disease. Now consider how many people we expect to test positive. Of the 10,000 people who have the disease, 99%, (i.e., 9,900) will be expected to test positive. Of the 9,990,000 people who do not have the disease, 5% (i.e., 499,500) will be expected to test positive. Now consider re-testing everyone who has tested positive on the first test. How many of them are expected to show a negative result on the re-test?
What proportion of people who test positive at first and then negative on retest, actually have the disease? In other words, of the total number of people at the bottom of the diagram in the previous part (those are the people who tested positive then negative), what proportion of them are in the left branch of the tree? How does the result compare with your answer to Exercise 5.1?

8. Suppose we have a test with a 97% specificity and a 99% sensitivity. Now suppose that a random person is selected, has a first test that is positive, then is retested and has a second test that is negative.
Taking into account both tests, and assuming the results of the two tests are independent, what is the probability that the person has the disease?

Hint: We can use the the posterior after the first test as a prior for the 
second test. Be sure to keep as many decimal digits as possible (use R and 
don't round intermediate results).

Note: In this problem we are assuming the the results of the two tests
are independent, which might not be the case for some medical tests.

9. Consider again the disease and diagnostic test of the previous exercise.

Suppose that a person selected at random from the population gets the test and it comes back negative. Compute the probability that the person has the disease.
The person then gets re-tested, and on the second test the result is positive. Compute the probability that the person has the disease.
How does the result compare with your answer in the previous exercise?