This introduction is loosely based on the following resources:

Loading some packages

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import xgboost as xgb

Loading some data

library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
data(Boston)
Boston |> DT::datatable()

The variable names are very terse. Use ? Boston in R to get more info.

Boston = r.Boston
Boston.shape
(506, 14)
Boston.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int32  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int32  
 9   tax      506 non-null    float64
 10  ptratio  506 non-null    float64
 11  black    506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(12), int32(2)
memory usage: 51.5 KB
X = Boston.drop('medv', axis = 1)  # axis = 'columns' also works
y = Boston['medv']

Let’s fit using defaults for starters.

model = xgb.XGBRegressor()
model.fit(X,y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

That was easy enough. Let’s see how well the model fits.

model.score(X,y)
0.999982906820517
But what is the score? Is that a good score?
# We can compute the model score by hand to confirm that we know which metric is 
# being used.
yhat = model.predict(X)
1 - ((y - yhat)**2).sum() / ((y - y.mean())**2).sum()
0.999982906820517

Let’s compare \(\hat y\) to \(y\) with a scatter plot.

library(ggformula)
gf_point(py$yhat ~ py$y)

Hmm… That seems to be fitting very well – perhaps too well. Let’s start over and make a few changes.

  1. Let’s create a training set and a test set.
  2. Let’s experiment with some of the parameter settings.
from sklearn.model_selection import train_test_split

(X_train, X_test, y_train, y_test) = \
  train_test_split(X, y, test_size = 0.25, random_state = 12345)
  
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(379, 13), (379,), (127, 13), (127,)]
model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)
model.score(X_train, y_train)
0.9999967536450873
model.score(X_test, y_test)
0.8411521612537046
yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)
gf_point(py$yhat_test ~ py$y_test, color = ~"test") |>
  gf_point(py$yhat_train ~ py$y_train, color = ~ "train", alpha = 0.4)

Experiments with tuning parameters

What tuning parameters are available?

model.get_params()
{'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'enable_categorical': False, 'gamma': 0, 'gpu_id': -1, 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.300000012, 'max_delta_step': 0, 'max_depth': 6, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 12, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None}

What things can we tweak in hopes of reducing the overfitting?

model = xgb.XGBRegressor(max_depth = 4)
model.fit(X_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=4, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=12,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)
model.score(X_train, y_train)
0.9995221712710443
model.score(X_test, y_test)
0.8153935400000378
model = xgb.XGBRegressor(max_depth = 4, tree_method = 'hist', max_leaves = 4, gamma = 2)
model.fit(X_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=2, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=4, max_leaves=4, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100,
             n_jobs=12, num_parallel_tree=1, predictor='auto', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='hist', validate_parameters=1, verbosity=None)
model.score(X_train, y_train)
0.9800735622263299
model.score(X_test, y_test)
0.7996959080743612

A more systematic approach

From a blog post

Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.

Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn’s Pipeline is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into to a single line of code.

Data

Let’s grab a new data set that has a few messier things to deal with. These data are from Kaggle (https://www.kaggle.com/competitions/home-data-for-ml-course/).

Ames = pd.read_csv('data/Ames-housing-train.csv')
# Ames_test = pd.read_csv('data/Ames-housing-test.csv')
Ames.shape
(1460, 81)

We will try to predict the sale price of houses from the otehr variables (features) available to us. So let’s create our usual X and y:

X = Ames.drop('SalePrice', axis = 1)
y = Ames['SalePrice']
  • Note: You might prefer to call these X_Ames and y_Ames, especially if you are working with multiple data sets.

We have a mix of numeric and categorical data. The categorical data is being read in as object dtype:

Ames.dtypes.value_counts()
object     43
int64      35
float64     3
dtype: int64

And we have some missing data

Ames.isnull().sum(axis = 0)
Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64
Doing this with comprehension instead.

Here is another way to get the information about missing data.

[v.isnull().sum() for n, v in Ames.iteritems()]
[0, 0, 0, 259, 0, 0, 1369, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 0, 0, 37, 37, 38, 37, 0, 38, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 690, 81, 81, 81, 0, 0, 81, 81, 0, 0, 0, 0, 0, 0, 0, 1453, 1179, 1406, 0, 0, 0, 0, 0, 0]
  • iteritems() produces an iterable over the columns of our pandas data frame. For each column we get the name of the column (n) and the values in the column (v).

  • v.isnull() returns True or False for each value in v (one of the columns of our data frame).

  • Since False = 0 and True = 1, adding gives us the number of True values.

Here is a fancier version:

[(n, v.isnull().sum()) for n, v in Ames.iteritems()]
[('Id', 0), ('MSSubClass', 0), ('MSZoning', 0), ('LotFrontage', 259), ('LotArea', 0), ('Street', 0), ('Alley', 1369), ('LotShape', 0), ('LandContour', 0), ('Utilities', 0), ('LotConfig', 0), ('LandSlope', 0), ('Neighborhood', 0), ('Condition1', 0), ('Condition2', 0), ('BldgType', 0), ('HouseStyle', 0), ('OverallQual', 0), ('OverallCond', 0), ('YearBuilt', 0), ('YearRemodAdd', 0), ('RoofStyle', 0), ('RoofMatl', 0), ('Exterior1st', 0), ('Exterior2nd', 0), ('MasVnrType', 8), ('MasVnrArea', 8), ('ExterQual', 0), ('ExterCond', 0), ('Foundation', 0), ('BsmtQual', 37), ('BsmtCond', 37), ('BsmtExposure', 38), ('BsmtFinType1', 37), ('BsmtFinSF1', 0), ('BsmtFinType2', 38), ('BsmtFinSF2', 0), ('BsmtUnfSF', 0), ('TotalBsmtSF', 0), ('Heating', 0), ('HeatingQC', 0), ('CentralAir', 0), ('Electrical', 1), ('1stFlrSF', 0), ('2ndFlrSF', 0), ('LowQualFinSF', 0), ('GrLivArea', 0), ('BsmtFullBath', 0), ('BsmtHalfBath', 0), ('FullBath', 0), ('HalfBath', 0), ('BedroomAbvGr', 0), ('KitchenAbvGr', 0), ('KitchenQual', 0), ('TotRmsAbvGrd', 0), ('Functional', 0), ('Fireplaces', 0), ('FireplaceQu', 690), ('GarageType', 81), ('GarageYrBlt', 81), ('GarageFinish', 81), ('GarageCars', 0), ('GarageArea', 0), ('GarageQual', 81), ('GarageCond', 81), ('PavedDrive', 0), ('WoodDeckSF', 0), ('OpenPorchSF', 0), ('EnclosedPorch', 0), ('3SsnPorch', 0), ('ScreenPorch', 0), ('PoolArea', 0), ('PoolQC', 1453), ('Fence', 1179), ('MiscFeature', 1406), ('MiscVal', 0), ('MoSold', 0), ('YrSold', 0), ('SaleType', 0), ('SaleCondition', 0), ('SalePrice', 0)]

And even fancier:

[(n, v.isnull().sum()) for n, v in Ames.iteritems() if v.isnull().sum() > 0]
[('LotFrontage', 259), ('Alley', 1369), ('MasVnrType', 8), ('MasVnrArea', 8), ('BsmtQual', 37), ('BsmtCond', 37), ('BsmtExposure', 38), ('BsmtFinType1', 37), ('BsmtFinType2', 38), ('Electrical', 1), ('FireplaceQu', 690), ('GarageType', 81), ('GarageYrBlt', 81), ('GarageFinish', 81), ('GarageQual', 81), ('GarageCond', 81), ('PoolQC', 1453), ('Fence', 1179), ('MiscFeature', 1406)]
[(n, v.sum()) for n, v in Ames.isnull().iteritems() if v.sum() > 0]
[('LotFrontage', 259), ('Alley', 1369), ('MasVnrType', 8), ('MasVnrArea', 8), ('BsmtQual', 37), ('BsmtCond', 37), ('BsmtExposure', 38), ('BsmtFinType1', 37), ('BsmtFinType2', 38), ('Electrical', 1), ('FireplaceQu', 690), ('GarageType', 81), ('GarageYrBlt', 81), ('GarageFinish', 81), ('GarageQual', 81), ('GarageCond', 81), ('PoolQC', 1453), ('Fence', 1179), ('MiscFeature', 1406)]

And let’s convert that into a data frame:

pd.DataFrame([(n, v.isnull().sum()) for n, v in Ames.iteritems() if v.isnull().sum() > 0])
               0     1
0    LotFrontage   259
1          Alley  1369
2     MasVnrType     8
3     MasVnrArea     8
4       BsmtQual    37
5       BsmtCond    37
6   BsmtExposure    38
7   BsmtFinType1    37
8   BsmtFinType2    38
9     Electrical     1
10   FireplaceQu   690
11    GarageType    81
12   GarageYrBlt    81
13  GarageFinish    81
14    GarageQual    81
15    GarageCond    81
16        PoolQC  1453
17         Fence  1179
18   MiscFeature  1406

As we have seen, missing data is not a big problem for XGBoost, but it is for some other algorithms. And it is just generally good to know what sort of missing data issues you might have.

In this case, missing seems to mean that the item doesn’t apply:

X['MiscFeature'].value_counts()
Shed    49
Gar2     2
Othr     2
TenC     1
Name: MiscFeature, dtype: int64
X['Fireplaces'].value_counts()
0    690
1    650
2    115
3      5
Name: Fireplaces, dtype: int64
X['FireplaceQu'].value_counts()
Gd    380
TA    313
Fa     33
Ex     24
Po     20
Name: FireplaceQu, dtype: int64
X['FireplaceQu'].isnull().value_counts()
False    770
True     690
Name: FireplaceQu, dtype: int64

Data Transformations

We’d like to perform two types of transforamtion on our data:

  • For categorical variables, we need to create (multiple) columns of 0’s and 1’s.

  • For numerical varaibles, let’s compute standardized versions (mean = 0, standard deviation = 1).

We could do this directly to our data (Ames or X), but instead, let’s create a pipeline that can perform these actions on any data set, since we’re likely to want to do this sort of thing frequently.

We start by importing some more stuff.

from sklearn.pipeline import Pipeline
import sklearn.preprocessing as pre
from sklearn.compose import ColumnTransformer

We can use ColumnTransformer() to perform our transformations on columns.

Let’s start by writing a couple functions that detect whether a column is numeric or categorical. ColumnTransformer() can take a function that returns a list of column names or a list of booleans.

from pandas.api.types import is_numeric_dtype
[is_numeric_dtype(v) for n, v in X.iteritems()]
[True, True, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, True, True, True, False, False, False, False, True, True, True, True, True, True, True, True, True, True, False, True, False, True, False, False, True, False, True, True, False, False, False, True, True, True, True, True, True, False, False, False, True, True, True, False, False]
def numeric_cols(X):
    return [is_numeric_dtype(v) for n, v in X.iteritems()]
def cat_cols(X):
    return [not is_numeric_dtype(v) for n, v in X.iteritems()]
numeric_cols(X)
[True, True, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, True, True, True, False, False, False, False, True, True, True, True, True, True, True, True, True, True, False, True, False, True, False, False, True, False, True, True, False, False, False, True, True, True, True, True, True, False, False, False, True, True, True, False, False]
cat_cols(X)
[False, False, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, False, False, False, True, True, True, True, False, False, False, False, False, False, False, False, False, False, True, False, True, False, True, True, False, True, False, False, True, True, True, False, False, False, False, False, False, True, True, True, False, False, False, True, True]

Now let’s create a data transformer using some of the preprocessing functions from scikitlearn.

data_transformer = ColumnTransformer(
  transformers = [
    ('rescale numeric', pre.StandardScaler(), numeric_cols),
    ('recode categorical', 
      pre.OneHotEncoder(handle_unknown = 'ignore', sparse = False), 
      cat_cols)
    ])
    
data_transformer.fit_transform(X).mean(axis = 0)
array([ 8.24302574e-17, -1.50412407e-16,             nan, -4.20278348e-17,
       -7.68030996e-17,  3.69263220e-16,  1.03298268e-15,  4.51891188e-15,
                   nan,  1.50564492e-17,  1.65316771e-16, -7.37613927e-17,
        2.06303772e-16,  7.68601317e-17, -3.26983494e-17,  1.92026760e-16,
       -1.44633164e-16,  4.10630434e-17,  9.99961149e-18,  2.77099500e-16,
       -3.01128985e-17, -1.10889228e-16,  2.23793586e-16,  4.38766223e-17,
        2.70103574e-16,             nan,  2.58545088e-17, -2.02273510e-17,
        2.19002898e-16,  3.35728401e-17,  1.37903387e-16, -4.00573790e-16,
        1.19919295e-16, -7.94588900e-16,  2.40960220e-16, -6.39518879e-17,
        3.56610099e-14,  6.84931507e-03,  4.45205479e-02,  1.09589041e-02,
        7.88356164e-01,  1.49315068e-01,  4.10958904e-03,  9.95890411e-01,
        3.42465753e-02,  2.80821918e-02,  9.37671233e-01,  3.31506849e-01,
        2.80821918e-02,  6.84931507e-03,  6.33561644e-01,  4.31506849e-02,
        3.42465753e-02,  2.46575342e-02,  8.97945205e-01,  9.99315068e-01,
        6.84931507e-04,  1.80136986e-01,  6.43835616e-02,  3.21917808e-02,
        2.73972603e-03,  7.20547945e-01,  9.46575342e-01,  4.45205479e-02,
        8.90410959e-03,  1.16438356e-02,  1.36986301e-03,  1.09589041e-02,
        3.97260274e-02,  1.91780822e-02,  1.02739726e-01,  3.49315068e-02,
        6.84931507e-02,  5.41095890e-02,  2.53424658e-02,  1.16438356e-02,
        3.35616438e-02,  1.54109589e-01,  6.16438356e-03,  5.00000000e-02,
        2.80821918e-02,  5.27397260e-02,  7.73972603e-02,  1.71232877e-02,
        5.06849315e-02,  4.04109589e-02,  5.89041096e-02,  1.71232877e-02,
        2.60273973e-02,  7.53424658e-03,  3.28767123e-02,  5.54794521e-02,
        8.63013699e-01,  5.47945205e-03,  1.30136986e-02,  7.53424658e-03,
        1.78082192e-02,  1.36986301e-03,  3.42465753e-03,  1.36986301e-03,
        4.10958904e-03,  9.89726027e-01,  6.84931507e-04,  1.36986301e-03,
        6.84931507e-04,  6.84931507e-04,  1.36986301e-03,  8.35616438e-01,
        2.12328767e-02,  3.56164384e-02,  2.94520548e-02,  7.80821918e-02,
        1.05479452e-01,  9.58904110e-03,  4.97260274e-01,  5.47945205e-03,
        7.53424658e-03,  3.04794521e-01,  2.53424658e-02,  4.45205479e-02,
        8.90410959e-03,  7.81506849e-01,  7.53424658e-03,  1.95890411e-01,
        4.79452055e-03,  1.36986301e-03,  6.84931507e-04,  9.82191781e-01,
        6.84931507e-04,  6.84931507e-04,  6.84931507e-04,  7.53424658e-03,
        3.42465753e-03,  4.10958904e-03,  1.36986301e-02,  6.84931507e-04,
        1.36986301e-03,  3.42465753e-02,  6.84931507e-04,  4.17808219e-02,
        1.52054795e-01,  6.84931507e-04,  1.50684932e-01,  7.39726027e-02,
        1.36986301e-03,  1.71232877e-02,  3.52739726e-01,  1.41095890e-01,
        1.78082192e-02,  1.36986301e-02,  2.05479452e-03,  4.79452055e-03,
        1.71232877e-02,  6.84931507e-04,  4.10958904e-02,  1.41780822e-01,
        6.84931507e-03,  1.46575342e-01,  6.84931507e-04,  9.72602740e-02,
        3.42465753e-03,  1.78082192e-02,  3.45205479e-01,  1.34931507e-01,
        2.60273973e-02,  1.02739726e-02,  3.04794521e-01,  5.91780822e-01,
        8.76712329e-02,  5.47945205e-03,  3.56164384e-02,  9.58904110e-03,
        3.34246575e-01,  6.20547945e-01,  2.05479452e-03,  1.91780822e-02,
        1.00000000e-01,  6.84931507e-04,  8.78082192e-01,  1.00000000e-01,
        4.34246575e-01,  4.43150685e-01,  1.64383562e-02,  4.10958904e-03,
        2.05479452e-03,  8.28767123e-02,  2.39726027e-02,  4.23287671e-01,
        4.44520548e-01,  2.53424658e-02,  3.08219178e-02,  4.45205479e-02,
        1.36986301e-03,  8.97945205e-01,  2.53424658e-02,  1.51369863e-01,
        9.17808219e-02,  7.80821918e-02,  6.52739726e-01,  2.60273973e-02,
        1.50684932e-01,  1.01369863e-01,  2.86301370e-01,  5.06849315e-02,
        9.10958904e-02,  2.94520548e-01,  2.53424658e-02,  1.30136986e-02,
        2.26027397e-02,  9.58904110e-03,  3.15068493e-02,  3.69863014e-02,
        8.60273973e-01,  2.60273973e-02,  6.84931507e-04,  9.78082192e-01,
        1.23287671e-02,  4.79452055e-03,  1.36986301e-03,  2.73972603e-03,
        5.07534247e-01,  3.35616438e-02,  1.65068493e-01,  6.84931507e-04,
        2.93150685e-01,  6.50684932e-02,  9.34931507e-01,  6.43835616e-02,
        1.84931507e-02,  2.05479452e-03,  6.84931507e-04,  9.13698630e-01,
        6.84931507e-04,  6.84931507e-02,  2.67123288e-02,  4.01369863e-01,
        5.03424658e-01,  9.58904110e-03,  3.42465753e-03,  2.12328767e-02,
        2.32876712e-02,  1.02739726e-02,  6.84931507e-04,  9.31506849e-01,
        1.64383562e-02,  2.26027397e-02,  2.60273973e-01,  1.36986301e-02,
        2.14383562e-01,  4.72602740e-01,  4.10958904e-03,  5.95890411e-01,
        1.30136986e-02,  6.02739726e-02,  6.16438356e-03,  2.65068493e-01,
        5.54794521e-02,  2.41095890e-01,  2.89041096e-01,  4.14383562e-01,
        5.54794521e-02,  2.05479452e-03,  3.28767123e-02,  9.58904110e-03,
        2.05479452e-03,  8.97945205e-01,  5.54794521e-02,  1.36986301e-03,
        2.39726027e-02,  6.16438356e-03,  4.79452055e-03,  9.08219178e-01,
        5.54794521e-02,  6.16438356e-02,  2.05479452e-02,  9.17808219e-01,
        1.36986301e-03,  1.36986301e-03,  2.05479452e-03,  9.95205479e-01,
        4.04109589e-02,  3.69863014e-02,  1.07534247e-01,  7.53424658e-03,
        8.07534247e-01,  1.36986301e-03,  1.36986301e-03,  3.35616438e-02,
        6.84931507e-04,  9.63013699e-01,  2.94520548e-02,  2.73972603e-03,
        1.36986301e-03,  6.16438356e-03,  3.42465753e-03,  3.42465753e-03,
        8.35616438e-02,  2.05479452e-03,  8.67808219e-01,  6.91780822e-02,
        2.73972603e-03,  8.21917808e-03,  1.36986301e-02,  8.20547945e-01,
        8.56164384e-02])

We can add this preprocessing to a pipeline that takes our “raw” data as input, transforms the columns, and then runs XGBoost:

from sklearn.pipeline import Pipeline
xgb_pipeline =  Pipeline(steps = [
    ('preprocess', data_transformer),
    ('XGB', xgb.XGBRegressor())
    ])
xgb_pipeline.fit(X, y)
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('rescale numeric',
                                                  StandardScaler(),
                                                  <function numeric_cols at 0x17eb01ab0>),
                                                 ('recode categorical',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  <function cat_cols at 0x17efc6680>)])),
                ('XGB',
                 XGBRegressor(base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree...
                              gamma=0, gpu_id=-1, importance_type=None,
                              interaction_constraints='',
                              learning_rate=0.300000012, max_delta_step=0,
                              max_depth=6, min_child_weight=1, missing=nan,
                              monotone_constraints='()', n_estimators=100,
                              n_jobs=12, num_parallel_tree=1, predictor='auto',
                              random_state=0, reg_alpha=0, reg_lambda=1,
                              scale_pos_weight=1, subsample=1,
                              tree_method='exact', validate_parameters=1,
                              verbosity=None))])
xgb_pipeline.score(X,y)
0.9996191793638435

Using XGBoost for classification

What needs to change?

  1. Conversions between linear scale and probability scale (just like logistic regression)
  2. Different loss function.
  3. Loss function leads to new gradient and hessian calculations
  4. Noting else!

So we get classification almost “for free”.

Linear -> Probability scale

\[ \log \mbox{odds} = \log (\frac{p}{1-p}) = y \] \[ \exp(y) = \frac{p}{1-p} \]

\[ \frac{ 1}{ 1 + \exp(-y)} = p \]

Loss function: Negative Log Likelihood

\[ l(y_i, \hat p_i) = -\log(\hat p^y_i \cdot (1-\hat p_i)^{1-y_i}) = - y_i \log \hat p_i - (1-y_i) \log(1-\hat p_i) \] This is often called the “log loss”.

Gradient/Hessian

Let’s calculate the gradient and hessian for the log loss.

\[ g_i = \frac {\partial l}{\partial \hat y_i} = \frac {\partial l}{\partial \hat p_i} \frac {\partial \hat p_i}{\partial \hat y_i} \]

\[ \frac {\partial l}{\partial \hat p_i} = \frac{y_i}{-\hat p_i} + \frac{1-y_i}{1-\hat p_i} = \frac{y_i - \hat p_i y_i + \hat p_i y_i - \hat p_i}{\hat p_i (1 - \hat p_i)} = - \frac{y_i - \hat p_i}{\hat p_i (1-\hat p_i)} \] and

\[ \frac {\partial \hat y_i}{\partial \hat p_i} = \frac{1}{\hat p_i} - \frac{1}{1 - \hat p_i} = \frac{1 - \hat p_i + \hat p_i}{ \hat p_i (1-\hat p_i)} = \frac{1}{ \hat p_i (1-\hat p_i)} \]

So \[ g_i = \frac {\partial l}{\partial \hat p_i} \frac {\partial \hat p_i}{\partial \hat y_i} = - \frac{y_i - \hat p_i}{\hat p_i (1-\hat p_i)} \hat p_i (1-\hat p_i) = - (y_i - \hat p_i) = - \mbox{residual} \] \[ h_i = \frac{\partial g_i}{\partial \hat p_i} \frac {\partial \hat p_i}{\partial \hat y_i} = 1 \cdot \hat p_i (1 - \hat p_i) \]

Preparing the data for XGBoost

The following things typically need to be done with our data before fitting our XGBoost model.

  1. Convert categorical variables into 1-hot vectors (0-1 vectors that have the value 1 for exactly one of the levels of the categorical variable).

  2. Rescale quantitative data.

    Many algorithms work better if the data are on a similar and modest scale. There are a number of normalizations that can be done. A common one is standardization: \(z = \frac{x - \overline{x}}{s_x}\).

  3. Remove predictors (features) that won’t be used in modeling.

  4. Separate the predictors (features) from the response (labels). In Python examples, these are often called X and y. That’s good for identifying their roles, but not so great for telling you what data they are associated with.

Using piplines

For a given example, we could do these data transformations in a number of different ways. But sklearn provides a general framework for this that has some important advantages: