This introduction is loosely based on the following resources:

What is XGBoost?

The name of the game

X = eXtreme (high performance code with clever optimizations/approximations)
- improvement over previous boosting algorithms like AdaBoost (Adaptive Boosting)
G = Gradient (think derivatives)
We’ll see what gradient and boosting refer to as we go along.

Key features

fast
flexible (regression, classification, ranking, missing data, customization, …)
excellent predictive behavior in a wide range of modeling situations
available in Python, R, julia, java, …

XGBoost PR

XGBoost gained its popularity in part because it has been highly successful in Kaggle competitions. See this list of some past winners.

A small data set

XGBoost has acheived its popularity because it works so well with large data sets, but to learn how the method works, we are going to explore a very small data set.

sex	age	daily_comp	score
M	13	Y	4
F	23	Y	3
M	55	N	-1
F	14	N	1
F	60	N	-2

Our goal is to predict the score (how much the person would like a given product on a scale from -5 to 5) from some predictors (features) such as age, sex, and whether they use a computer most days. Here our five people, affectionately named 1, 2, 3, 4, and 5:

We can load the data in python as follows:

import numpy as np
import pandas as pd
Five_People = pd.read_csv('data/five-people.csv')

Five_People

  sex  age daily_comp  score
0   M   13          Y      4
1   F   23          Y      3
2   M   55          N     -1
3   F   14          N      1
4   F   60          N     -2

Notice that python indexes the people 0 through 4. Actually, in the underlying C++ code, the indexing is also 0-based, so sometimes you need to use 0-based numbering even when using the R interface.

Fitting models

General framework for supervised learning

Predictors/features: $x_i$
Response: $y_i$
Specify set of available model functions $\mathcal F$.
Parameters $\theta$ distinguish the members of $\mathcal F$.
Pick the “best” $f \in \mathcal F$ by optimizing some metric that measures the quality of the model
- optimization can be hard, for some models we settle for an approximate solution to the optimization.
- The function being optimized is called the objective function.
Loss measures quality of fit. We want loss to be small.
- Simple approach: objective = loss; fit model by minimizing loss.

Regularization: Guarding against overfitting

Most models will overfit if we just minimize loss.
- Model learns general patterns of the data set, but also learns particular patterns that may not generalize. Model can’t tell the difference.
- So other things may be done to prevent this. We can think of this as adding a penalty to the loss.

\[\begin{align*} \mbox{objective} &= \mbox{loss} + \mbox{penalty} \\ obj(\theta) &= L(\theta) + \Omega(\theta) \end{align*}\]

Typically the penalty is some measure of model complexity. More complex models are penalized more heavily.
Penalty often done at the very end
- Fit by minimizing loss, then add on a penalty when evaluating the fit.
- Examples: adjusted-$R^2$ for regression, AIC
XGBoost builds in the penalty as the model is fit.
- Other models take this approach as well.

Familiar Example: Linear Regression

Model form

\[\begin{align*} \mbox{prediction} &= \fbox{intercept} + \fbox{$\mathrm{coef}_1$} \cdot \mbox{pred}_1 + \fbox{$\mathrm{coef}_2$} \cdot \mbox{pred}_2 + \cdots + \fbox{$\mathrm{coef}_k$} \cdot \mbox{pred}_k \\ \hat y_i &= \hat\beta_0 + \hat\beta_1 x_{i1} + \hat\beta_2 x_{i2} + \cdots + \hat\beta_k x_{ik} \\ \hat{\mathbf{y}} &= \mathbf{X} \hat{\mathbf{\beta}} \end{align*}\]

Loss

Quadratic Loss (Least Squares) take the sum of the squared residuals as the loss function:

\[ L(\mathbf{y}, \hat{\mathbf{y}}) = \sum_{i = 1}^n (y_i - \hat y_i)^2 \]

Penalty

Typically the number of parameters = $1 +$ number of predictors is used as part of the penalty. Models with more terms are more flexible, and therefore more prone to overfitting. This is how adjusted-$R^2$ works, for example.

More nuanced methods try to quantify the actual flexibility of the model, rather than assuming that all parameters are created equal.

XGBoost (for regression for now)

Model form

An ensemble of trees

Loss (very flexible)

$L()$ must depend only on predictions: $L(\theta) = L(y, \hat y)$.
$L()$ must be additive: $L(y, \hat y) = \sum_{i = 1}^n l(y_i, \hat y_i)$.
$l()$ must be differentiable (2x) [ideally with “simple derivatives”]
starter loss function:

Penalty

We will say more about this as we go along
Denoted as $\Omega(\theta)$

Objective

$obj(\theta) = L(\theta) + \Omega(\theta)$.
Include penalty as the model is being fit.

\[\begin{align*} obj(\theta) &= L(\theta) + \Omega(\theta) \\ & = \sum_{i=1}^n l(y_i, \hat y_i) + \Omega(\theta) \end{align*}\]

Trees as models

How a tree computes a function

Example

Some Notation/Terminology (danger: mixed metaphors ahead!)

root = beginning point for the tree
node = a “location” in the tree
parent/child = adjacent nodes; parent is closer to root
leaf = node with no children
depth = longest path from root to a leaf
$q(x_i) =$ the leaf associated with $x_i$ in a tree
$w_q =$ the weight assigned to leaf $q$
So $f(x) = w_{q(x_i)}$

What kinds of trees are allowed?

XGBoost uses trees that have these restrictions:

Binary – each non-leaf has two children
Each non-leaf splits on one predictor ($<$ some threshold)

Simplest tree ever?

Q. What is the simplest possible tree?

How should its leaves be labeled?
What is the associated loss for this (labelled) tree?

The simplest tree has only one node. So $\hat y_i = w$ for all $i$ where $w$ is the weight of the only node.

We want to minimize loss, so let’s do a little calculus.

$L(y, \hat y) = \sum_i l(y_i, w)^2 = \sum_i (y_i - w)^2$.

$L' = - \sum_i 2 (y_i - w)$

$L' = 0$ when $\sum_i y_i = nw$, so $w = \overline{y}$.

$L(y, \overline{y}) = \sum_i (y_i - \overline{y})^2 = (n-1) \operatorname{Var}(y)$.

So our “starter model” for XGBoost regression just predicts the average response no matter what the predictor values are .

Building a perfect tree

Q. How do we build a tree that fits perfectly? (0 loss).

Answer.

Create a tree with as many leaves as rows of the data. (This only works if no rows of predictors are repeated.) Weight each leaf with the observed response value.

Something in between

Build a tree with 3 leaves and compute the loss for your tree.

Example

Let’s weight the leaves of this tree 4, 1, 0 from left to right. Then $y = <4,3,-1,1,-2>$, $\hat y = <4,0, 0,1,0>$, and $y - \hat y = <0, 3, 0, 0, -2>$ so $L = 0^2 + 3^2 + 1^2 + 0^2 + 2^2 = 14$.

Can you find a tree with three leaves and smaller loss?

How an (additive) ensemble of trees computes a function

An example

Notation

So our model function is going to look like

\[ F(x_i) = \sum_t f_t(x_i) \]

where each $f$ is a function determined by a (simple) tree.

Two Big Ideas

Big idea #1: combine lots of weak learners to make one strong learner.
Big idea #2: Boosting keeps refining our estimate by adding on a little more – giving the previous estimate a “boost”.
- After the first step, each subsequent addition to the model is trying to make up for deficiencies in what has come before.
- This is different from an esemble appraoch that averages the results of many learners, each of which is attempting to solve the original problem.

Note: This same idea could work with models formed in other ways, the use of trees is not essential to the method. But trees work well in a wide range of applications and usually when someone says XGBoost, they mean the tree boosting algorithm.

Penalizing Trees

Q. How should we penalize trees? (Hint: What kinds of things might cause a tree to overfit?)

Trees with more leaves should be penalized more.
Trees with larger weights should be penalized more.

If $f$ is represented by a tree with $T$ leaves, then XGBoost uses the following complexity measure:

\[ \Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j = 1}^T w_i^2 \]

$\lambda$ and $\gamma$ are tuning parameters. We can specify how much we want to penalize for leaves and for weight sizes.
$\gamma$ = gamma [Default = 0]
$\lambda$ = lambda [Default = 1]

Approximating loss and the general algorithm

Taylor Approximation

\[\begin{align*} L(a + \Delta ) &\approx L(a) + F'(a) \Delta + \frac{1}{2} F''(a) \Delta^2 \\ &= L(a + \Delta) + g \Delta + \frac{1}{2} h \Delta^2 \end{align*}\]

where $g$ is the gradient (first derivative) and $h$ is the hessian (2nd derivative).

And because our loss functions are additive, we can write this as

\[\begin{align*} L(a + \Delta) &\approx \sum_i l(a_i) + g_i \Delta + \frac{1}{2} h_i \Delta^2 \end{align*}\]

Where $g_i = \frac{\partial l}{\partial \hat y_i}$, and $h_i = \frac{\partial^2 l}{\partial \hat y_i^2}$. (We’ll see soon how to choose $a$ and $\Delta$.)

Applying this to our quadratic loss function (where the “approximation” is exact):

$g_i = - 2(y_i - \hat y_i) = -2 \cdot \mbox{residual}_i$
$h_i = 2$

To avoid those 2’s, sometimes the loss function is specified as

\[ L(\mathbf{y}, \hat{\mathbf{y}}) = \sum_i \frac{1}{2} (y_i - \hat y_i)^2 \]

The results of XGboost will be the same with any constant multiple of a given loss function (and a similar adjustment to the penalty term) since the minima occur at the same place. This is just about making the notation nicer (or choosing a form that has computational advantages).

In any case, the big story here is this:

If we can handle quadratic loss, we should be able to handle (at least approximately) any twice differentiable loss function.
If we express things in terms of $g_i$ and $h_i$, we won’t have to redo any of our work when we switch to an arbitrary loss function.

Minimizing quadratic functions

Quadratic functions (like our least squares loss or quadratic approximations to other loss functions) are easy to minimize.

Suppose $F = G x + \frac{1}{2} H x^2 + k$. We can minimize by determining where the derivative is 0.

$F' = G + Hx = 0$ when $x = \frac{-G}{H}$. If $H >0$, then this determines a minimum, so

\[\begin{align*} \operatorname{argmin}_x F(x) &= \frac{-G}{H} \\ \operatorname{min}_x F(x) &= F\left(\frac{-G}{H}\right) = - \frac{1}{2} \frac{G^2}{H} + k \end{align*}\]

Putting the pieces together

All that remains is to figure out how to choose the trees and the weights at each step of the process. Suppose our current prediction is

\[ \hat y_i^{(t)} = F_t(x_i) = f_1(x_i) + f_2(x_i) + \cdots + f_t(x_i) \]

Our goal is to come up with the next little $f$, once we have the previous ones. We would like to do this to minimize our objective function:

\[\begin{align*} obj(\theta) &= L(\theta) + \Omega(\theta) \\ & = \sum_{i=1}^n l(y_i, \hat y_i^{(t)}) + \sum_{k = 1}^{t} \Omega(f_k) \end{align*}\]

Much of this is constant:

We can’t change the data, so $y_i$ is constant.
We can’t change the earlier $f_k$’s, so they are constant.
Only things that depend on $f_{t}$ (our new tree) are involved in the minimization.
The only part of $\hat y_i$ that changes is the last term: $f_t(x_i)$.

Using our quadratic approximation to the loss function we get

\[ l(y_i, \hat y_i^{(t)}) = l(y_i, \hat y_i^{(t-1)}) + f_{t} (x_i)) \approx l(y_i, \hat y_i^{(t-1)}) + g_i^{(t-1)} f_t(x_i) + \frac12 h_i^{(t-1)} \left(f_t(x_i)\right)^2 \]

For the moment, let’s assume that the tree structure for $f_t$ is known and has $T$ leaves. We want to determine the leaf weights $w$ and how well the proposed tree works (in terms of loss + penalty).

Dropping the superscripts on $g$ and $h$, we can write our objective function as

\[\begin{align*} obj & = \sum_{i=1}^n g_i f_{t}(x_i) + \frac12 h_i \left(f_{t}(x_i)\right)^2 + \Omega(f_t) + \mbox{constant} \\ &= \sum_{i=1}^n g_i f_{t}(x_i) + \frac12 h_i \left(f_{t}(x_i)\right)^2 + \frac 12 \lambda \sum_{j = 1}^T w_j^2 + \gamma T + \mbox{constant} \end{align*}\]

For each leaf $j$, $j = 1, 2, \dots T$, let $I_j$ be the indices of the rows assigned to leaf $j$ by our tree and define

$G_j = \sum_{i \in I_j} g_i$
$H_j = \sum_{i \in I_j} h_i$

That is, add up all the $g$’s and $h$’s associated with each leaf. Our object then becomes

\[\begin{align*} obj &= \sum_{j=1}^T G_i f_{t}(x_i) + \frac12 H_j \left(f_{t}(x_i)\right)^2 + \frac 12 \lambda \sum_{j = 1}^T w_j^2 + \gamma T + \mbox{constant} \\ &= \sum_{j=1}^T \left[G_j w_j + \frac12 H_j w_j^2 + \frac 12 \lambda w_j^2 \right] + \gamma T + \mbox{constant} \\ &= \sum_{j=1}^T \left[G_j w_j + \frac12 (H_j + \lambda) w_j^2 \right] + \mbox{constant} \end{align*}\]

The quadratic pieces are independent, so we can minimize by minimizing each piece using our minimization result for quadratic functions:

$\displaystyle w_j = \frac{-G_j}{H_j + \lambda}$
$\displaystyle obj = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j + \lambda} + \gamma T + \mbox{constant}$

An example

Let’s use our five person dataset and suppose that in step 1 we used a tree with one leaf and minimized loss for that shape tree. Then $f_1(x_i) = \overline y = 1.0$, as we saw above.

Let’s update $f_2$ using this tree:

For least squares loss we have

$g_i = -2 (y_i - \hat y_i) = -2 \cdot \mbox{residual}$
$h_i = 2$

Let’s add these values to our data table:

sex	age	daily_comp	score	leaf	resid	g	h
M	13	Y	4	1	3	-6	2
F	23	Y	3	3	2	-4	2
M	55	N	-1	3	-2	4	2
F	14	N	1	2	0	0	2
F	60	N	-2	3	-3	6	2

Now suppose $\lambda = 1$, then we can compute the weight for each leaf using

\[ w = \frac{-G}{H + \lambda} \]

leaf	𝝺	G	H	w = -G / (H + 𝝺)
1	1	-6	2	2
2	1	0	2	0
3	1	6	6	-0.8571

Exploring all the tree structures (efficiently)

We’ve demonstrated how to find the optimal weights for a a particular tree structure. We need to do this for all possible tree structures and pick the best one… But that’s a LOT of tree structures, so we need a couple more ideas.

The main one is to build the trees one branching at a time. This greedy algorithm is not guaranteed to lead to the optimal tree, but in practice it works well in most cases.

The key insight is that we can easily compute whether a given split of a node improves things or not. We’ll start with a tree that has just one node and keep trying to split nodes. When we consider splitting a node into left and right children, we can easily compute how much the objective function will change because the only part of the objective function that changes is local to the node being split:

\[ \operatorname{Gain} = \underbrace{ - \underbrace{\frac{(G_L + G_R)^2}{H_L + H_R + \lambda} }_{\mbox{old node}} + \underbrace{\frac{G^2_L}{H_L + \lambda} + \frac{G^2_R}{H_R + \lambda} }_{\mbox{two new nodes}}}_{\mbox{change in loss}} \;\;\; \underbrace {\;\; - \lambda\quad }_{\mbox{change in complexity}} \] If $\operatorname{Gain} < 0$, then the split is worse than the tree without the split, so we don’t make the split. If $\operatorname{Gain} > 0$, then we do better with the split than without. We need the loss to decrease by more than the penalty parameter $\lambda$.

This approach does a sort of auto-pruning of the tree, not adding splits that don’t seem to be improving things. So there is another option: we can let the trees grow until they reach a maximum allowable depth and then remove splits that don’t have positive gain. Post-pruning is slower than pre-stopping, but it allows the algorithm to consider splits that only pay off after additional splits are made.

Note: Gain will be larger if the values in the split nodes are more similar to each other than in the combined node. This is because we add the residuals first and then square. So if there are some positive and some negative residuals, they cancel each other out. If we can split the node to separate the positive and negative residuals, then we will be able to improve our loss.

Because of this, the expression $\frac{-G^2}{H + \lambda}$ is sometimes called the similarity score of a node.

For least squares loss and the standard complexity score,

\[ \mbox{similarity score} = \frac{ 2 (\mbox{sum of residuals})^2}{2 (\mbox{number of residuals}) + \lambda} = \frac{ (\mbox{sum of residuals})^2}{(\mbox{number of residuals}) + \frac{\lambda}{2}} \]

Slowing things down: the learning rate

Once the optimal tree is found, we don’t actually use it! Instead of adding $f_t(x)$ to our model, we add only a fraction of that:

\[ F_t(x) = F_{t-1}(x) + \eta f_t(x) \] for some number $\eta$ (eta) between 0 and 1.

Why slow things down?

When the loss function is only approximately quadratic, this is like moving only a small distance along our approximation and then reapproximating, so the overall approximation will be better.
It reduces the risk of overfitting by keeping the model from “getting too excited” about particular values or sets of values.

Main tuning parameters

$\lambda$ (lambda) – how much to penalize large weights in trees
$\gamma$ (gamma) – how much to penalize leaves in trees
$\eta$ (eta) – learning rate, what fraction of the current tree to add to the model function we are building up.
n_estimators – how many trees to combine (this needs to be larger when eta is smaller.
max_depth – maximum depth of a tree

Fitting XGBoost models in Python

Fitting XGBoost models in Python is pretty easy. The trickier parts are

deciding on the tuning parameters
interpreting the results

We’ll have more to say about that another day. For now, let’s just fit a model to our simple data set.

See if you can anticipate what the key steps/ingredients are going to be.

Fitting an XGBoost model

Load packages

import pandas as pd
import numpy as np
import xgboost as xgb

/Library/Frameworks/R.framework/Versions/4.1/Resources/library/reticulate/python/rpytools/loader.py:39: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  module = _import(
/Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Load and prep data

Five_People = pd.read_csv('data/five-people.csv', 
  dtype = {'sex': 'category', 'daily_comp': 'category'})
  
Five_People

  sex  age daily_comp  score
0   M   13          Y      4
1   F   23          Y      3
2   M   55          N     -1
3   F   14          N      1
4   F   60          N     -2

Five_People.dtypes

sex           category
age              int64
daily_comp    category
score            int64
dtype: object

X = Five_People.iloc[:, 0:3] 
y = Five_People.iloc[:, 3:4] 

# convert categorical data to 1-hot encodings
X['sex'] = pd.get_dummies(X['sex']).iloc[:, 0:1]
X['daily_comp'] = pd.get_dummies(X['daily_comp']).iloc[:, 0:1]
X

   sex  age  daily_comp
0    0   13           0
1    1   23           0
2    0   55           1
3    1   14           1
4    1   60           1

X.dtypes

sex           uint8
age           int64
daily_comp    uint8
dtype: object

Fit with XGBRegressor()

For the moment, let’s just use the default values of the tuning parameters. You can see what those are in the output below.

model = xgb.XGBRegressor( )
model.fit(X,y)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

/Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):

Five_People['prediction'] = model.predict(X)
Five_People

  sex  age daily_comp  score  prediction
0   M   13          Y      4    3.998966
1   F   23          Y      3    3.000029
2   M   55          N     -1   -0.999993
3   F   14          N      1    0.999880
4   F   60          N     -2   -1.998883

That the model does this well should not surprise us since it is allowing for

100 trees (n_estimators = 100)
of depth 6 (max_depth = 6)
with no penalty for adding leaves (gamma = 0)
modest penalty for large weights (reg_lambda = 1)

And we already know that one tree with 5 leaves can fit the training data perfectly.

Let’s try setting some of those tuning parameters.

model = xgb.XGBRegressor(tree_method = 'hist',
                         max_depth = 2, max_leaves = 3,
                         reg_lambda = 0, gamma = 0, 
                         grow_policy = 'lossguide',
                         n_estimators = 2, eta = 1)
model.fit(X,y)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             eta=1, gamma=0, gpu_id=-1, grow_policy='lossguide',
             importance_type=None, interaction_constraints='', learning_rate=1,
             max_delta_step=0, max_depth=2, max_leaves=3, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=2, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=0, scale_pos_weight=1, subsample=1, tree_method='hist',
             validate_parameters=1, ...)

/Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):

Five_People['prediction'] = model.predict(X)
Five_People

  sex  age daily_comp  score  prediction
0   M   13          Y      4         4.0
1   F   23          Y      3         3.0
2   M   55          N     -1        -1.0
3   F   14          N      1         1.0
4   F   60          N     -2        -2.0

import matplotlib.pyplot as plt

xgb.plot_tree(model, num_trees = 0)
plt.show()

xgb.plot_tree(model, num_trees = 1)
plt.show()

Notes:

num_trees is not an ideal name for the argument to plot_tree(). It specifies the index of one tree, not a number of trees.
It turns out that XGBoost (at least in default settings) starts its prediction with $f_{-1}(x) = 0.5$ for all values of the predictor $x$ (rather than minimizing the loss for a tree with one node). We can see that looking at the two trees above.

\[ F(x) = f_{-1}(x) + f_0(x) + f_1(x) \]

\[ F(x) = 0.5 + f_0(x) + f_1(x) \]

You can adjust this starting value with base_score if you ever want to.

Intro to XGBoost

Five People

Data 303