Nonlinear Least Squares

Nonlinear Models

Simple Linear Model

\[ Y = \beta_0 + \beta_1 x + \varepsilon \qquad \mbox{where $\varepsilon \sim {\sf Norm}(0,\sigma)$.} \]

In other words

\[ Y \sim {\sf Norm}(\beta_0 +\beta_1 x, \sigma) \] ### Nonlinear Model

\[ Y = f(x) + \varepsilon \qquad \mbox{where $\varepsilon \sim {\sf Norm}(0,\sigma)$.} \]

In other words

\[ Y \sim {\sf Norm}(f(x), \sigma) \] Basically, replace the linear function with some other function.

Advantages:

flexible – we can contruct $f()$ any way we like
natural – we don’t have to come up with transformations to make things linear

Disadvantages:

The mathematics isn’t as nice – no nice formulas for slope and intercept
Fitting (minimizing SSE) has to be done numerically (and the algorithm can sometimes fail)

Inconvenience

Since non-linear models are more flexible, we will need to work harder to describe them to R.
- The lm() notation only requires you to specify the two variables – it knows the model wil lbe linear.
- The nls() function will require us to
  - describe the function we are using,
  - specify which parts of the formula are paramters to be fit, and
  - provide initial values for each parameter as a starting point for the numerical optimization.

Examples using nls()

Example: Dropping balls

Let’s start with a redo of a model we fit using lm():

\[ \mbox{time} = \beta_0 + \beta_1 \sqrt{\mbox{height}} + \varepsilon \]

Here’s our fit using lm().

library(fastR2)
ball.lm <- lm(time ~ sqrt(height), data = BallDrop)

And with nls():

ball.nls <- nls(time ~ beta1 * sqrt(height) + beta0, data = BallDrop, 
    start = list(beta0 = 0, beta1 = 1))

time ~ beta1 * sqrt(height) + beta0
- explicit description of the model function (including paramters)
start = list(beta0 = 0, beta1 = 1)
- tells R what the paramters are, and
- where to start the numerical search for the parameter values that produce the smallest sum of squared residuals.

Since we are fitting the same model two different ways, it isn’t surprising that they give the same parameter estimates.

coef(ball.lm)

##  (Intercept) sqrt(height) 
##       0.0161       0.4308

coef(ball.nls)

##  beta0  beta1 
## 0.0161 0.4308

Summary information for the two models is similar, but not identical

msummary(ball.lm)

##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.01608    0.00408    3.94    5e-04 ***
## sqrt(height)  0.43080    0.00486   88.58   <2e-16 ***
## 
## Residual standard error: 0.00523 on 28 degrees of freedom
## Multiple R-squared:  0.996,  Adjusted R-squared:  0.996 
## F-statistic: 7.85e+03 on 1 and 28 DF,  p-value: <2e-16

msummary(ball.nls)

## 
## Formula: time ~ beta1 * sqrt(height) + beta0
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## beta0  0.01608    0.00408    3.94    5e-04 ***
## beta1  0.43080    0.00486   88.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00523 on 28 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 2.11e-07

In particular, we are getting the same standard error estimates as well.

Example: Dropping balls – new model

Now let’s fit a different model, namely

\[ \mbox{time} = \alpha \cdot \mbox{height}^p \]

In this model

there is no intercept
the power is not specified – we estimate it from the data.

power.model1 <- 
  nls(time ~ alpha * height^power, data = BallDrop, 
      start = c(alpha = 1, power = .5))
coef(summary(power.model1))

##       Estimate Std. Error t value Pr(>|t|)
## alpha    0.447    0.00134   333.1 6.33e-52
## power    0.480    0.00581    82.6 5.39e-35

A power law can also be fit using lm() by using a log-log transformation.

\[\begin{align*} \log(\mbox{time}) &= \log(\alpha \cdot \mbox{height}^p) \\ &= \log(\alpha) + p \log(\mbox{height}) \end{align*}\]

In this model

The slope ($\beta_1$) is $p$.
The intercept ($\beta_0$) is $\log(\alpha)$, so $\alpha = e^{\beta_0}$.

power.model2 <- lm(log(time) ~ log(height), data = BallDrop)
coef(summary(power.model2))

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -0.808    0.00433  -186.5 7.10e-45
## log(height)    0.472    0.00642    73.5 1.43e-33

Since $\alpha = e^{\beta_0}$, we can use the delta method to get the estimate and uncertainty for $\alpha$.

exp(coef(power.model2)[1])

## (Intercept) 
##       0.446

Since $\frac{d}{dx} e^x = e^x$ the uncertainty is approximately \[ e^{-0.808} \cdot 0.004 = 0.446 \cdot 0.004 = 0.002 \]

In this case, our results are similar, but not identical. In fact the two models are not equivalent.

When we use lm(), the residuals are computed after we transform $y$ to the log scale – so the residuals are on the log scale
When we use nls(), the residuals are computed on the natural scale.

Both models want their error term to be normally distributed with a mean of 0, but they think about the error differently. In particular, for lm(), if we include the error term in our algebra, we see that

\[ \log(y) = \beta_0 + \beta_1 \log(x) + \varepsilon \] So

\[ y = e^\beta_0 \cdot x^{\beta}) \cdot e^{\varepsilon} \] This means that for this model on the natural scale,

in nls() the errors are additive [absolute difference]
in lm(), the errors (residuals) are multiplicative [percentage difference]

When the range of $x$ spans multiple orders of magnitude, this can make a big difference. One models says the points should be roughly the same distance from the curve, the other says they should be roughly the same percentage away from the curve.

We can use our diagnostic plots to see whether one of these models matches the data better. Unfortunately, mplot() doesn’t (yet) handle nls models, and plot() doesn’t do nearly as much. So we’ll just make the plots using ggformula.

gf_point(resid(power.model1) ~ height, data = BallDrop, title = "Model 1") %>%
  gf_smooth(color = "steelblue") /
gf_point(resid(power.model1) ~ fitted(power.model1), title = "Model 1") %>%
  gf_smooth(color = "red") /
gf_qq( ~ resid(power.model1), title = "Model 1") %>%
  gf_qqline() |
gf_point(resid(power.model2) ~ height, data = BallDrop, title = "Model 2") %>%
  gf_smooth(color = "steelblue") /
gf_point(resid(power.model2) ~ fitted(power.model2), title = "Model 2") %>% 
  gf_smooth(color = "red") /
gf_qq( ~ resid(power.model2), title = "Model 2") %>%
  gf_qqline()

## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'

In this case, a slight edge goes to model 1 (the nls model) because the standard deviation of the residuals looks more nearly constant. In model 2, the residuals seem to be more variable on the left side of the plot and less variable on the right.

Example: Ball Drop – A model lm() can’t handle

Here is a small modification of our previous model that adds in an intercept ($k$):

\[ \mbox{time} = \alpha \cdot \mbox{height}^p + k \]

power.model3 <- 
  nls(time ~ alpha * height^power + k, data = BallDrop, 
      start = c(alpha = 1, power = .5, k = 0))
coef(summary(power.model3))

##       Estimate Std. Error t value Pr(>|t|)
## alpha   0.3761     0.0208   18.11 1.24e-16
## power   0.5929     0.0408   14.54 2.74e-14
## k       0.0702     0.0204    3.45 1.86e-03

Note that given the estimate and uncertainty for $k$, we see that 0 might not be the best choice for $k$. Note

In terms of physics, $k = 0$ is clearly right.
In terms of the experiment, there may be a reason why the times are off by a constant $k$, perhaps do to the way height is measured or the specifics of the timing device used.

gf_point(resid(power.model3) ~ height, data = BallDrop, title = "Model 3") /
gf_qq(~ resid(power.model3), data = BallDrop, title = "Model 3") %>% 
  gf_qqline() |
gf_point(resid(power.model1) ~ height, data = BallDrop, title = "Model 1") /
gf_qq(~ resid(power.model1), data = BallDrop, title = "Model 1")  %>%
  gf_qqline()

Plotting the original data and the three model fits on the natural scale we see that they are similar, but not identical.

gf_point(time ~ height, data = BallDrop) %>%
  gf_lims(x = c(0, 1.7)) %>%
  gf_fun(makeFun(power.model1), color = ~ "Model 1") %>%
  gf_fun(makeFun(power.model2), color = ~ "Model 2") %>%
  gf_fun(makeFun(power.model3), color = ~ "Model 3")

Widening the window, we can see how models 1 and 2 force the predicted time to be 0 when height is 0 and model 3 does not.

Example: Cooling Water

A professor at Macalester College put hot water in a mug and recorded the temperature as it cooled. Let’s see if we can fit a reasonable model to this data

gf_point(temp ~ time, data = CoolingWater, ylab = "temp (C)", xlab = "time (sec)")

Our first guess might be some sort of exponential decay

cooling.model1 <- 
  nls(temp ~ A * exp( -k * time), data = CoolingWater, 
      start = list(A = 100, k = 0.1))
gf_point(temp ~ time, data = CoolingWater, 
         ylab = "temp (C)", xlab = "time (sec)") %>%
  gf_fun(makeFun(cooling.model1)) %>%
  gf_lims(x = c(0, 300), y = c(0, 110))

That doesn’t fit very well, and there is a good reason. The model says that eventually the water will freeze because \[ \lim_{t \to \infty} A e^{-k t} = 0 \] when $k >0$. But clearly our water isn’t going to freeze sitting on a lab table. We can fix this by adding in an offset to account for the ambient temperature:

cooling.model2 <- nls(temp ~ ambient + A * exp(k * (1+time)), data = CoolingWater,
                      start = list(ambient = 20, A = 80, k = -.1) )
gf_point(temp ~ time, data = CoolingWater, 
         ylab = "temp (C)", xlab = "time (sec)") %>%
  gf_fun(makeFun(cooling.model2), linetype = 2, color = "red") %>%
  gf_fun(makeFun(cooling.model1), color = "steelblue") %>%
  gf_lims(x = c(0, 300), y = c(0, 110))

This fits much better. Furthermore, this model can be derived from a differential equation \[ \frac{dT}{dt} = -k (T_0 - T_{\mathrm{ambient}}) \;, \] known as Newton’s Law of Cooling.

Let’s take a look at the residual plot

gf_point(resid(cooling.model2) ~ time, data = CoolingWater)

plot(cooling.model2, which = 1)

Hmm. These plots show a clear pattern and very little noise. The fit doesn’t look as good when viewed this way.
It suggests that Newton’s Law of Cooling does not take into account all that is going on here. In particular, there is a considerable amount of evaporation (at least at the beginning when the water is warmer). More complicated models that take this into account can fit even better. For a discussion of a model that includes evaporation, see http://stanwagon.com/public/EvaporationPortmannWagonMiER.pdf.¹

Things to remember about `nls()`

You need to specify the model explicitly, including the parameters.
You need to provide starting values for the numerical optimization search.
The numerical optimization can fail.
- Choosing a better starting point can help. It is good to choose starting values that are at least the correct sign and order of magnitude.
- There are also some controls that can be used to tweak the algorithm used for minimized the sum of squares, so if the first attempt fails, there may be ways to rescue the situation. [Knowing how to do this is not part of this course, but see the help for nls() if you want to know more.

`lm()` or `nls()`?

Some models cannot be expressed as linear models, even after transformations.

In this case we only have one option, the non-linear model. So nls() is more flexible.
Linear models are easier to fit.

Linear models can be fit quickly and accurately without numerical optimization algorithms because they satisfy nice linear algebra properties.

The use of numerical optimizers in non-linear least squares models makes them subject to potential problems with the optimizers. They may not converge, may converge to the wrong thing, or convergence may depend on choosing an appropriate starting point for the search.
Two models might make different assumptions about the error terms.

Inspection of the residuals will often indicate if one model does better than the other in this regard.
Linear models fit with lm() provide an easy way to produce confidence intervals for a mean response or an individual response. The models fit using nls() do not have this capability.

Some Practice

1. More ball dropping You might remmeber this equation from calculus or physics:

\[ s = \frac{g}{2} t^2 \] where $s$ is the distance an object has fallen, $t$ is the time it has fallen, and $g$ is the force due to gravity.

Fit this model using the BallDrop data and nls(). Report the estimate and uncertainty for $g$.
Solve the equation for $t^2$. Use this to fit a model using nls() with $t^2$ as the response. Report the estimate and uncertainty for $g$.
Solve the equation for $t$. Use this to fit a model using nls() with $t$ as the response. Report the estimate and uncertainty for $g$.
Why are these models not all the same?
Are the estimtes for $g$ consistent among the models? (What does that mean? Hint: think about uncertainty.)
Is there any reason to prefer one of these models over the others?
How might you use lm() to estimate $g$? Do it.

2. Use the Spheres data in fastR2 to estimate the density of the material the spherese are made of. [Hint: how is density related to radius and mass?]

3. By attaching a heavy object to the end of a string, it is easy to construct pendulums of different lengths. Some physics students did this to see how the period (time in seconds until a pendulum returns to the same location moving in the same direction) depends on the length (in meters) of the pendulum.
The students constructed pendulums of lengths varying from $10$ cm to $16$ m and recorded the period length (averaged over several swings of the pendulum). The resulting data are in the Pendulum data set in the fastR2 package.

Fit a power law to this data using nls() and using lm(). How do the two compare?

The model with evaporation adds another complication in that the resulting differential equation cannot be solved algebraically, so there is no algebraic formula to fit with nls().
But the method of least squares can still be used by creating a parameterized numerical function that computes the sum of squares and using a numerical minimizer to find the optimal parameter values. Since the use of numerical differential equation solvers is a bit beyond the scope of this course, we’ll leave that discussion for another day.↩︎