Resampling with mosaic

Randall Pruim
eCOTS 2014

You can do() it.

Lady Tasting Tea

  • Often used on first day of class
  • Story
    • woman claims she can tell whether milk has been poured into tea or vice versa.
    • Question: How do we test this claim?

Lady Tasting Tea

  • Often used on first day of class
  • Story
    • woman claims she can tell whether milk has been poured into tea or vice versa.
    • Question: How do we test this claim?
      • Have her taste some tea prepared each way
        (Flip a coin to decide which way.)
      • How many cups? (Let's try 10.)
      • How do we evaluate her score?

Physical Simulation

  • Have students guess a sequence of 10 heads and tails.

  • Flip a coin 10 times and compare.

  • Before the reveal, ask students

    • What do you expect your score to be?
    • What do you think the best score in the class will be?
    • Is a perfect score possible just by guessing? Likely?
    • What about 9 out of 10?

Computer Simulation

Use rflip() to simulate flipping coins

rflip()

Flipping 1 coin [ Prob(Heads) = 0.5 ] ...

H

Number of Heads: 1 [Proportion Heads: 1]

Computer Simulation

Faster if we flip multiple coins at once:

rflip(10)

Flipping 10 coins [ Prob(Heads) = 0.5 ] ...

H H H T T T H H H T

Number of Heads: 6 [Proportion Heads: 0.6]
  • easier to consider heads = correct; tails = incorrect than to compare with a given pattern
    • this switch bothers me more than it bothers my students

Let's do that a lot of times

rflip(10) simulates 1 lady tasting 10 cups 1 time.

We can do that many times to see how guessing ladies do:

do(2) * rflip(10)
   n heads tails prop
1 10     3     7  0.3
2 10     5     5  0.5
  • do() is clever about what it remembers
  • 2 isn't many – we'll do many next.

The distribution of guessing ladies

Ladies <- do(5000) * rflip(10)
head(Ladies, 1)
   n heads tails prop
1 10     4     6  0.4
histogram( ~ heads, data=Ladies, width=1 )

plot of chunk unnamed-chunk-6

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data=Ladies)

 TRUE FALSE 
   63  4937 

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data=Ladies, 
                       format="prop")

  TRUE  FALSE 
0.0126 0.9874 

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data=Ladies, 
                       format="prop")

  TRUE  FALSE 
0.0126 0.9874 
 prop( ~(heads >= 9), data=Ladies)
  TRUE 
0.0126 

A general approach to randomization

  1. Do it for your data
  2. Do it for “random” data
  3. Do it lots of times for “random” data
  • definition of “random” is important, but can often be handled by shuffle() or resample()

Example: Do mean ages differ by sex?

diffmean(age ~ sex, data=HELPrct)
diffmean 
 -0.7841 
do(1) * 
  diffmean(age ~ shuffle(sex), data=HELPrct)
  diffmean
1    1.798
Null <- do(5000) * 
  diffmean(age ~ shuffle(sex), data=HELPrct)

Example: Do mean ages differ by sex?

prop( ~(abs(diffmean) > 0.7841), data=Null )
  TRUE 
0.3586 
histogram(~ diffmean, data=Null, v=-.7841) 

plot of chunk unnamed-chunk-11

Example: Bootstrap for difference in means

Bootstrap <- do(5000) * 
  diffmean(age~sex, data= resample(HELPrct))

histogram( ~diffmean, data=Bootstrap, 
                      v=-.7841, glwd=4 )

plot of chunk unnamed-chunk-12

Example: CI for difference in mean ages

cdata(~diffmean, data=Bootstrap, p=.95)
      low        hi central.p 
  -2.4617    0.8276    0.9500 
confint(Bootstrap, method="quantile")
      name  lower  upper level   method
1 diffmean -2.462 0.8276  0.95 quantile

Example: CI for difference in mean ages

confint(Bootstrap)  # default uses st. err
      name  lower  upper level method estimate margin.of.error
1 diffmean -2.432 0.8761  0.95 stderr   -0.778           1.654

Randomization and linear models

do(1) * lm(width ~ length, data=KidsFeet)
  Intercept length  sigma r.squared
1     2.862 0.2479 0.3963     0.411
do(3) * lm( width ~ shuffle(length), data=KidsFeet)
  Intercept    length  sigma r.squared
1     8.975 0.0007113 0.5164 3.382e-06
2     8.416 0.0232976 0.5155 3.629e-03
3     8.345 0.0261777 0.5152 4.581e-03

Randomization and linear models

do(1) * 
  lm(width ~ length + sex, data=KidsFeet)
  Intercept length    sexG  sigma r.squared
1     3.641  0.221 -0.2325 0.3849    0.4595
do(3) * 
  lm( width ~ length + shuffle(sex), 
                       data=KidsFeet)
  Intercept length      sexG  sigma r.squared
1     2.907 0.2486 -0.127647 0.3963    0.4271
2     2.883 0.2464  0.034996 0.4014    0.4122
3     2.849 0.2487 -0.008942 0.4018    0.4111

Randomization and linear models

Null <- do(5000) * 
  lm( width ~ length + shuffle(sex), 
                       data=KidsFeet)
histogram( ~ sexG, data=Null, 
           v=-0.2325, glwd=4)

plot of chunk unnamed-chunk-17

Randomization and linear models

histogram( ~ sexG, data=Null, 
           v=-0.2325, glwd=4)

plot of chunk unnamed-chunk-18

prop( ~ (sexG <= -0.2325), data=Null )
  TRUE 
0.0382