Resampling with mosaic

Randall Pruim
eCOTS 2014

You can do() it.

Lady Tasting Tea

Often used on first day of class
Story
- woman claims she can tell whether milk has been poured into tea or vice versa.
- Question: How do we test this claim?

Lady Tasting Tea

Often used on first day of class
Story
- woman claims she can tell whether milk has been poured into tea or vice versa.
- Question: How do we test this claim?
  - Have her taste some tea prepared each way
    (Flip a coin to decide which way.)
  - How many cups? (Let's try 10.)
  - How do we evaluate her score?

Physical Simulation

Have students guess a sequence of 10 heads and tails.
Flip a coin 10 times and compare.
Before the reveal, ask students
- What do you expect your score to be?
- What do you think the best score in the class will be?
- Is a perfect score possible just by guessing? Likely?
- What about 9 out of 10?

Computer Simulation

Use rflip() to simulate flipping coins

rflip()


Flipping 1 coin [ Prob(Heads) = 0.5 ] ...

H

Number of Heads: 1 [Proportion Heads: 1]

Computer Simulation

Faster if we flip multiple coins at once:

rflip(10)


Flipping 10 coins [ Prob(Heads) = 0.5 ] ...

H H H T T T H H H T

Number of Heads: 6 [Proportion Heads: 0.6]

easier to consider heads = correct; tails = incorrect than to compare with a given pattern
- this switch bothers me more than it bothers my students

Let's do that a lot of times

rflip(10) simulates 1 lady tasting 10 cups 1 time.

We can do that many times to see how guessing ladies do:

do(2) * rflip(10)

   n heads tails prop
1 10     3     7  0.3
2 10     5     5  0.5

do() is clever about what it remembers
2 isn't many – we'll do many next.

The distribution of guessing ladies

Ladies <- do(5000) * rflip(10)
head(Ladies, 1)

   n heads tails prop
1 10     4     6  0.4

histogram( ~ heads, data=Ladies, width=1 )

plot of chunk unnamed-chunk-6

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data=Ladies)


 TRUE FALSE 
   63  4937

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data=Ladies, 
                       format="prop")


  TRUE  FALSE 
0.0126 0.9874

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data=Ladies, 
                       format="prop")


  TRUE  FALSE 
0.0126 0.9874

 prop( ~(heads >= 9), data=Ladies)

  TRUE 
0.0126

A general approach to randomization

Do it for your data
Do it for “random” data
Do it lots of times for “random” data

definition of “random” is important, but can often be handled by shuffle() or resample()

Example: Do mean ages differ by sex?

diffmean(age ~ sex, data=HELPrct)

diffmean 
 -0.7841

do(1) * 
  diffmean(age ~ shuffle(sex), data=HELPrct)

  diffmean
1    1.798

Null <- do(5000) * 
  diffmean(age ~ shuffle(sex), data=HELPrct)

Example: Do mean ages differ by sex?

prop( ~(abs(diffmean) > 0.7841), data=Null )

  TRUE 
0.3586

histogram(~ diffmean, data=Null, v=-.7841)

plot of chunk unnamed-chunk-11

Example: Bootstrap for difference in means

Bootstrap <- do(5000) * 
  diffmean(age~sex, data= resample(HELPrct))

histogram( ~diffmean, data=Bootstrap, 
                      v=-.7841, glwd=4 )

plot of chunk unnamed-chunk-12

Example: CI for difference in mean ages

cdata(~diffmean, data=Bootstrap, p=.95)

      low        hi central.p 
  -2.4617    0.8276    0.9500

confint(Bootstrap, method="quantile")

      name  lower  upper level   method
1 diffmean -2.462 0.8276  0.95 quantile

Example: CI for difference in mean ages

confint(Bootstrap)  # default uses st. err

      name  lower  upper level method estimate margin.of.error
1 diffmean -2.432 0.8761  0.95 stderr   -0.778           1.654

Randomization and linear models

do(1) * lm(width ~ length, data=KidsFeet)

  Intercept length  sigma r.squared
1     2.862 0.2479 0.3963     0.411

do(3) * lm( width ~ shuffle(length), data=KidsFeet)

  Intercept    length  sigma r.squared
1     8.975 0.0007113 0.5164 3.382e-06
2     8.416 0.0232976 0.5155 3.629e-03
3     8.345 0.0261777 0.5152 4.581e-03

Randomization and linear models

do(1) * 
  lm(width ~ length + sex, data=KidsFeet)

  Intercept length    sexG  sigma r.squared
1     3.641  0.221 -0.2325 0.3849    0.4595

do(3) * 
  lm( width ~ length + shuffle(sex), 
                       data=KidsFeet)

  Intercept length      sexG  sigma r.squared
1     2.907 0.2486 -0.127647 0.3963    0.4271
2     2.883 0.2464  0.034996 0.4014    0.4122
3     2.849 0.2487 -0.008942 0.4018    0.4111

Randomization and linear models

Null <- do(5000) * 
  lm( width ~ length + shuffle(sex), 
                       data=KidsFeet)
histogram( ~ sexG, data=Null, 
           v=-0.2325, glwd=4)

plot of chunk unnamed-chunk-17

Randomization and linear models

histogram( ~ sexG, data=Null, 
           v=-0.2325, glwd=4)

plot of chunk unnamed-chunk-18

prop( ~ (sexG <= -0.2325), data=Null )

  TRUE 
0.0382