The mosaic
package provides facilities for putting model fitting into a “mathematical function” framework.
lm()
and glm()
mosaic::fitModel()
provides a named-parameter interface to fitting, both linear and nonlinear.xyplot()
and mosaic::plotFun()
allow students to display the model functions over the data.Mortality at 20 years of follow-up for smokers and non-smokers.
sample( Whickham, size=5 )
## outcome smoker age orig.ids
## 665 Alive No 22 665
## 781 Alive No 32 781
## 1062 Dead No 67 1062
## 956 Dead No 72 956
## 200 Alive No 51 200
smod <- glm( outcome=='Alive' ~ age*smoker, family='binomial',
data=Whickham)
f <- makeFun(smod) # EXTRACT THE FUNCTION
Evaluation … For a 60-year old …
f( age=60, smoker=c('Yes'))
## 1
## 0.5129
xyplot( outcome=='Alive' ~ age | smoker, data=Whickham,
pch=20, alpha=.2 )
plotFun( f(age=age, smoker="No") ~ age, add=TRUE)
plotFun( f(age=age, smoker="Yes") ~ age, add=TRUE,
col='red', lwd=2)
hmod <- lm( height ~ father, data=subset(Galton,sex=="M"))
Extract the function … just like resid( )
or fitted( )
hfun <- makeFun(hmod)
Plot it … first data, then the function.
xyplot(height ~ father | sex, data=Galton)
plotFun( hfun(x) ~ x, add=TRUE)
mod2 <- lm( height ~ father + sex, data=Galton)
hfun2 <- makeFun(mod2)
xyplot(height ~ father | sex, data=Galton)
plotFun( hfun2(x, sex="F") ~ x, add=TRUE)
plotFun( hfun2(x, sex="M") ~ x, add=TRUE, col='red')
Or …
mod3 <- lm( height ~ father*mother + sex, data=Galton)
hfun3 <- makeFun(mod3)
plotFun( hfun3(father=x, mother=y, sex="F") ~ x&y,
x.lim=c(60,80),y.lim=c(57,73))
plotPoints( mother ~ father, data=Galton, add=TRUE,
pch=20,col='red')
hfun2( father=66, sex=c('M','F'), interval='confidence')
## fit lwr upr
## 1 67.87 67.59 68.16
## 2 62.70 62.40 62.99
hfun2( father=66, sex=c('M','F'), interval='prediction')
## fit lwr upr
## 1 67.87 63.40 72.35
## 2 62.70 58.22 67.18
ggplot(data=Galton, aes(x=father, y=height)) + geom_point() +
aes(colour=sex) + stat_smooth(method=lm) +
theme(legend.position="none") + labs(title="")
Convenient interface via mosaic:mPlot()
.
One of the objectives of Project MOSAIC is to make connections between statistics and mathematics and computation. The mosaic
package offers some features specifically oriented to doing math.
These are suggestions or demonstrations appropriate in teaching statistical modeling to students not afraid of math, e.g. engineers, statistics minors, …
f <- makeFun( 3*x + 2 ~ x)
Read this as, “make a function f
that takes \(x\) as a input and produces \(3x + 2\) as the output.
Evaluating the function is a matter of providing an input:
f(4)
## [1] 14
It’s easy to plot functions:
plotFun( 3*x + 2 ~ x, x.lim=range(-5,5) )
Or, if we already have the function f
, give \(x\) as an input:
plotFun( f(x) ~ x, x.lim=range(-5,5) )
Functions of multiple variables follow the same scheme:
g <- makeFun( 3*x*y + 2*x -4*y +2 ~ x&y)
g(x=0,y=2)
## [1] -6
plotFun( g(x=x,y=y)~x&y,
x.lim=range(-2,2),y.lim=range(-2,2) )
Or even cross-sections or parametrically …
plotFun( g(x=x, y=0.5) ~ x, x.lim=range(-2,2) )
plotFun( g(x=3*cos(t),y=sin(t))~ t, t.lim=range(-5,5))
Why is this relevant to statistical modeling?
The workhorses of model building in R are lm()
and glm()
. They are great, but unfortunately they return values that are not what students are used to.
In particular, students are used to lines expressed as \(y = m x + b\), etc.
lm()
gives back something different, e.g.
lm( height ~ father, data=Galton )
## (Intercept) father
## 39.1104 0.3994
How different this is from \(mx+b\).
And then, what happens when you plot a model …
m = lm( height ~ father, data=Galton )
plot(m)
The lm()
function was written by statisticians for statisticians. Confidence intervals and p-values are easy:
coef(summary(m))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.1104 3.22706 12.120 2.056e-31
## father 0.3994 0.04658 8.574 4.355e-17
lm()
is particularly valuable when there are multiple explanatory variables.
m2 <- lm( height ~ sex + father + mother, data=Galton)
coef(summary(m2))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.3448 2.74696 5.586 3.082e-08
## sexM 5.2260 0.14401 36.289 5.787e-178
## father 0.4060 0.02921 13.900 6.526e-40
## mother 0.3215 0.03128 10.277 1.702e-23
One strategy is to start with what students already know, and build on that.
height(father) = m * father + b
height(father)
:
Galton
in this case.Variable names are set by the data:
names(Galton)
## [1] "family" "father" "mother" "sex" "height" "nkids"
height(father) ~ m * father + b
And do the fitting:
f <- fitModel( height ~ m*father + b, data=Galton)
You plot this the same way as “mathematical” functions:
plotFun( f(father=x)~x, x.lim=range(60,80) )
If you want to talk about residuals, least squares, etc, you can do so. For instance, have students guess the function before fitting it …
xyplot( height ~ father, data=Galton)
myGuess <- makeFun(40 + .35*x ~ x )
plotFun( myGuess(x) ~ x, add=TRUE, col='red')
plotFun( f(father=x) ~ x, add=TRUE, col='green')
Which is better?
mean( (myGuess(father)-height)^2, data=Galton )
## [1] 18.26
mean( (f(father) - height)^2, data=Galton)
## [1] 11.85
Students can explore these questions:
f0 <- fitModel(height ~ a, data=Galton)
f1 <- fitModel(height ~ a + b*father, data=Galton)
f2 <- fitModel(height ~ a + b*father + c*father^2, data=Galton)
xyplot( height ~ father, data=Galton)
plotFun( f0(father=x)~x, add=TRUE)
plotFun( f1(father=x)~x, add=TRUE, col='red')
plotFun( f2(father=x)~x, add=TRUE, col='green')
Which function is better? (Hint: it doesn’t depend on the data!)
mean( (f0(father)-height)^2, data=Galton )
## [1] 12.82
mean( (f1(father)-height)^2, data=Galton )
## [1] 11.85
mean( (f2(father)-height)^2, data=Galton )
## [1] 11.83
It just isn’t packaged as such. We can do so with mosaic::makeFun()
m1 <- lm( height ~ father, data=Galton)
g1 <- makeFun(m1)
g1( father=65 )
## 1
## 65.07
While we’re at it …
Let’s do what Galton never could: Include women!
And we can take a liberal view about how the child came to be: the father and the mother multiplying:
mod <- lm( height ~ sex + father*mother, data=Galton )
f <- makeFun( mod )
f( sex='F', father=65, mother=65 )
## 1
## 62.59
plotFun( f(sex='M', father=x, mother=y) ~ x & y,
x.lim=range(60,80), y.lim=range(60,80))
g <- smoother( height ~ father, data=Galton, degree=1 )
xyplot(height ~ father, data=Galton)
plotFun( g(father=x)~x, add=TRUE, lwd=3 )
plotFun( f1(father=x)~x, add=TRUE, col='red',
lwd=3, alpha=.5)
Where the data are, the two functions are basically the same.
Example: Mortality at 20 years of follow-up for smokers and non-smokers.
sample( Whickham, size=5 )
## outcome smoker age orig.ids
## 665 Alive No 22 665
## 781 Alive No 32 781
## 1062 Dead No 67 1062
## 956 Dead No 72 956
## 200 Alive No 51 200
counts <- tally( ~ outcome & smoker, data=Whickham )
chisq.test( counts )
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: counts
## X-squared = 8.752, df = 1, p-value = 0.003093
And what’s the effect size? In what direction?
tally( outcome ~ smoker, data=Whickham )
## smoker
## outcome No Yes
## Alive 0.6858 0.7612
## Dead 0.3142 0.2388
Really? Smokers had a larger chance of surviving?
Just as there are smoothers, there are function types suited to Dead/Alive outcomes … e.g. logistic regression.
smod <- glm( outcome=='Alive' ~ age*smoker, family='binomial',
data=Whickham)
f <- makeFun(smod)
For a 60-year old …
f( age=60, smoker=c('Yes','No'))
## 1 2
## 0.5129 0.5437
A function is a function, whatever the form!
xyplot( outcome=='Alive' ~ age | smoker, data=Whickham,
pch=20, alpha=.2 )
plotFun( f(age=age, smoker="No") ~ age, add=TRUE)
plotFun( f(age=age, smoker="Yes") ~ age, add=TRUE,
col='red', lwd=2)
actual <- lm(height ~ father+sex, data=Galton)
f <- makeFun( actual )
plotFun(f(x, sex='F')~x, x.lim=range(65,80))
# Confidence interval?
for (k in 1:50) {
trial <- lm(height ~ father+sex, data=resample(Galton) )
f <- makeFun( trial )
print(plotFun(f(x, sex='F')~x, add=TRUE, col='blue', alpha=.5))
1
}
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
# Null Hypothesis?
for (k in 1:10) {
trial <- lm(height ~ father+shuffle(sex), data=Galton )
f <- makeFun( trial )
print(plotFun(f(x, sex='F')~x, add=TRUE, col='red', alpha=.5))
1
}
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL