This introduction is loosely based on the following resources:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import xgboost as xgb
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
data(Boston)
Boston |> DT::datatable()
The variable names are very terse. Use ? Boston
in R to get more info.
Boston = r.Boston
Boston.shape
(506, 14)
Boston.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 crim 506 non-null float64
1 zn 506 non-null float64
2 indus 506 non-null float64
3 chas 506 non-null int32
4 nox 506 non-null float64
5 rm 506 non-null float64
6 age 506 non-null float64
7 dis 506 non-null float64
8 rad 506 non-null int32
9 tax 506 non-null float64
10 ptratio 506 non-null float64
11 black 506 non-null float64
12 lstat 506 non-null float64
13 medv 506 non-null float64
dtypes: float64(12), int32(2)
memory usage: 51.5 KB
X = Boston.drop('medv', axis = 1) # axis = 'columns' also works
y = Boston['medv']
Let’s fit using defaults for starters.
model = xgb.XGBRegressor()
model.fit(X,y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=12,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
That was easy enough. Let’s see how well the model fits.
model.score(X,y)
0.999982906820517
# We can compute the model score by hand to confirm that we know which metric is
# being used.
yhat = model.predict(X)
1 - ((y - yhat)**2).sum() / ((y - y.mean())**2).sum()
0.999982906820517
Let’s compare \(\hat y\) to \(y\) with a scatter plot.
library(ggformula)
gf_point(py$yhat ~ py$y)
Hmm… That seems to be fitting very well – perhaps too well. Let’s start over and make a few changes.
from sklearn.model_selection import train_test_split
(X_train, X_test, y_train, y_test) = \
train_test_split(X, y, test_size = 0.25, random_state = 12345)
[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(379, 13), (379,), (127, 13), (127,)]
model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=12,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
model.score(X_train, y_train)
0.9999967536450873
model.score(X_test, y_test)
0.8411521612537046
yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)
gf_point(py$yhat_test ~ py$y_test, color = ~"test") |>
gf_point(py$yhat_train ~ py$y_train, color = ~ "train", alpha = 0.4)
What tuning parameters are available?
model.get_params()
{'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'enable_categorical': False, 'gamma': 0, 'gpu_id': -1, 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.300000012, 'max_delta_step': 0, 'max_depth': 6, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 12, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None}
What things can we tweak in hopes of reducing the overfitting?
model = xgb.XGBRegressor(max_depth = 4)
model.fit(X_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=4, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=12,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
model.score(X_train, y_train)
0.9995221712710443
model.score(X_test, y_test)
0.8153935400000378
model = xgb.XGBRegressor(max_depth = 4, tree_method = 'hist', max_leaves = 4, gamma = 2)
model.fit(X_train,y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=2, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=4, max_leaves=4, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100,
n_jobs=12, num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='hist', validate_parameters=1, verbosity=None)
model.score(X_train, y_train)
0.9800735622263299
model.score(X_test, y_test)
0.7996959080743612
From a blog post
Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.
Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn’s Pipeline is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into to a single line of code.
Let’s grab a new data set that has a few messier things to deal with. These data are from Kaggle (https://www.kaggle.com/competitions/home-data-for-ml-course/).
Ames = pd.read_csv('data/Ames-housing-train.csv')
# Ames_test = pd.read_csv('data/Ames-housing-test.csv')
Ames.shape
(1460, 81)
We will try to predict the sale price of houses from the otehr variables (features) available to us. So let’s create our usual X
and y
:
X = Ames.drop('SalePrice', axis = 1)
y = Ames['SalePrice']
X_Ames
and y_Ames
, especially if you are working with multiple data sets.We have a mix of numeric and categorical data. The categorical data is being read in as object
dtype:
Ames.dtypes.value_counts()
object 43
int64 35
float64 3
dtype: int64
And we have some missing data
Ames.isnull().sum(axis = 0)
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 81, dtype: int64
Here is another way to get the information about missing data.
[v.isnull().sum() for n, v in Ames.iteritems()]
[0, 0, 0, 259, 0, 0, 1369, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 0, 0, 37, 37, 38, 37, 0, 38, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 690, 81, 81, 81, 0, 0, 81, 81, 0, 0, 0, 0, 0, 0, 0, 1453, 1179, 1406, 0, 0, 0, 0, 0, 0]
iteritems()
produces an iterable over the columns of our pandas data frame. For each column we get the name of the column (n
) and the values in the column (v
).
v.isnull()
returns True
or False
for each value in v
(one of the columns of our data frame).
Since False = 0
and True = 1
, adding gives us the number of True
values.
Here is a fancier version:
[(n, v.isnull().sum()) for n, v in Ames.iteritems()]
[('Id', 0), ('MSSubClass', 0), ('MSZoning', 0), ('LotFrontage', 259), ('LotArea', 0), ('Street', 0), ('Alley', 1369), ('LotShape', 0), ('LandContour', 0), ('Utilities', 0), ('LotConfig', 0), ('LandSlope', 0), ('Neighborhood', 0), ('Condition1', 0), ('Condition2', 0), ('BldgType', 0), ('HouseStyle', 0), ('OverallQual', 0), ('OverallCond', 0), ('YearBuilt', 0), ('YearRemodAdd', 0), ('RoofStyle', 0), ('RoofMatl', 0), ('Exterior1st', 0), ('Exterior2nd', 0), ('MasVnrType', 8), ('MasVnrArea', 8), ('ExterQual', 0), ('ExterCond', 0), ('Foundation', 0), ('BsmtQual', 37), ('BsmtCond', 37), ('BsmtExposure', 38), ('BsmtFinType1', 37), ('BsmtFinSF1', 0), ('BsmtFinType2', 38), ('BsmtFinSF2', 0), ('BsmtUnfSF', 0), ('TotalBsmtSF', 0), ('Heating', 0), ('HeatingQC', 0), ('CentralAir', 0), ('Electrical', 1), ('1stFlrSF', 0), ('2ndFlrSF', 0), ('LowQualFinSF', 0), ('GrLivArea', 0), ('BsmtFullBath', 0), ('BsmtHalfBath', 0), ('FullBath', 0), ('HalfBath', 0), ('BedroomAbvGr', 0), ('KitchenAbvGr', 0), ('KitchenQual', 0), ('TotRmsAbvGrd', 0), ('Functional', 0), ('Fireplaces', 0), ('FireplaceQu', 690), ('GarageType', 81), ('GarageYrBlt', 81), ('GarageFinish', 81), ('GarageCars', 0), ('GarageArea', 0), ('GarageQual', 81), ('GarageCond', 81), ('PavedDrive', 0), ('WoodDeckSF', 0), ('OpenPorchSF', 0), ('EnclosedPorch', 0), ('3SsnPorch', 0), ('ScreenPorch', 0), ('PoolArea', 0), ('PoolQC', 1453), ('Fence', 1179), ('MiscFeature', 1406), ('MiscVal', 0), ('MoSold', 0), ('YrSold', 0), ('SaleType', 0), ('SaleCondition', 0), ('SalePrice', 0)]
And even fancier:
[(n, v.isnull().sum()) for n, v in Ames.iteritems() if v.isnull().sum() > 0]
[('LotFrontage', 259), ('Alley', 1369), ('MasVnrType', 8), ('MasVnrArea', 8), ('BsmtQual', 37), ('BsmtCond', 37), ('BsmtExposure', 38), ('BsmtFinType1', 37), ('BsmtFinType2', 38), ('Electrical', 1), ('FireplaceQu', 690), ('GarageType', 81), ('GarageYrBlt', 81), ('GarageFinish', 81), ('GarageQual', 81), ('GarageCond', 81), ('PoolQC', 1453), ('Fence', 1179), ('MiscFeature', 1406)]
[(n, v.sum()) for n, v in Ames.isnull().iteritems() if v.sum() > 0]
[('LotFrontage', 259), ('Alley', 1369), ('MasVnrType', 8), ('MasVnrArea', 8), ('BsmtQual', 37), ('BsmtCond', 37), ('BsmtExposure', 38), ('BsmtFinType1', 37), ('BsmtFinType2', 38), ('Electrical', 1), ('FireplaceQu', 690), ('GarageType', 81), ('GarageYrBlt', 81), ('GarageFinish', 81), ('GarageQual', 81), ('GarageCond', 81), ('PoolQC', 1453), ('Fence', 1179), ('MiscFeature', 1406)]
And let’s convert that into a data frame:
pd.DataFrame([(n, v.isnull().sum()) for n, v in Ames.iteritems() if v.isnull().sum() > 0])
0 1
0 LotFrontage 259
1 Alley 1369
2 MasVnrType 8
3 MasVnrArea 8
4 BsmtQual 37
5 BsmtCond 37
6 BsmtExposure 38
7 BsmtFinType1 37
8 BsmtFinType2 38
9 Electrical 1
10 FireplaceQu 690
11 GarageType 81
12 GarageYrBlt 81
13 GarageFinish 81
14 GarageQual 81
15 GarageCond 81
16 PoolQC 1453
17 Fence 1179
18 MiscFeature 1406
As we have seen, missing data is not a big problem for XGBoost, but it is for some other algorithms. And it is just generally good to know what sort of missing data issues you might have.
In this case, missing seems to mean that the item doesn’t apply:
X['MiscFeature'].value_counts()
Shed 49
Gar2 2
Othr 2
TenC 1
Name: MiscFeature, dtype: int64
X['Fireplaces'].value_counts()
0 690
1 650
2 115
3 5
Name: Fireplaces, dtype: int64
X['FireplaceQu'].value_counts()
Gd 380
TA 313
Fa 33
Ex 24
Po 20
Name: FireplaceQu, dtype: int64
X['FireplaceQu'].isnull().value_counts()
False 770
True 690
Name: FireplaceQu, dtype: int64
We’d like to perform two types of transforamtion on our data:
For categorical variables, we need to create (multiple) columns of 0’s and 1’s.
For numerical varaibles, let’s compute standardized versions (mean = 0, standard deviation = 1).
We could do this directly to our data (Ames
or X
), but instead, let’s create a pipeline that can perform these actions on any data set, since we’re likely to want to do this sort of thing frequently.
We start by importing some more stuff.
from sklearn.pipeline import Pipeline
import sklearn.preprocessing as pre
from sklearn.compose import ColumnTransformer
We can use ColumnTransformer()
to perform our transformations on columns.
Let’s start by writing a couple functions that detect whether a column is numeric or categorical. ColumnTransformer()
can take a function that returns a list of column names or a list of booleans.
from pandas.api.types import is_numeric_dtype
[is_numeric_dtype(v) for n, v in X.iteritems()]
[True, True, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, True, True, True, False, False, False, False, True, True, True, True, True, True, True, True, True, True, False, True, False, True, False, False, True, False, True, True, False, False, False, True, True, True, True, True, True, False, False, False, True, True, True, False, False]
def numeric_cols(X):
return [is_numeric_dtype(v) for n, v in X.iteritems()]
def cat_cols(X):
return [not is_numeric_dtype(v) for n, v in X.iteritems()]
numeric_cols(X)
[True, True, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, True, True, True, False, False, False, False, True, True, True, True, True, True, True, True, True, True, False, True, False, True, False, False, True, False, True, True, False, False, False, True, True, True, True, True, True, False, False, False, True, True, True, False, False]
cat_cols(X)
[False, False, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, False, False, False, True, True, True, True, False, False, False, False, False, False, False, False, False, False, True, False, True, False, True, True, False, True, False, False, True, True, True, False, False, False, False, False, False, True, True, True, False, False, False, True, True]
Now let’s create a data transformer using some of the preprocessing functions from scikitlearn
.
data_transformer = ColumnTransformer(
transformers = [
('rescale numeric', pre.StandardScaler(), numeric_cols),
('recode categorical',
pre.OneHotEncoder(handle_unknown = 'ignore', sparse = False),
cat_cols)
])
data_transformer.fit_transform(X).mean(axis = 0)
array([ 8.24302574e-17, -1.50412407e-16, nan, -4.20278348e-17,
-7.68030996e-17, 3.69263220e-16, 1.03298268e-15, 4.51891188e-15,
nan, 1.50564492e-17, 1.65316771e-16, -7.37613927e-17,
2.06303772e-16, 7.68601317e-17, -3.26983494e-17, 1.92026760e-16,
-1.44633164e-16, 4.10630434e-17, 9.99961149e-18, 2.77099500e-16,
-3.01128985e-17, -1.10889228e-16, 2.23793586e-16, 4.38766223e-17,
2.70103574e-16, nan, 2.58545088e-17, -2.02273510e-17,
2.19002898e-16, 3.35728401e-17, 1.37903387e-16, -4.00573790e-16,
1.19919295e-16, -7.94588900e-16, 2.40960220e-16, -6.39518879e-17,
3.56610099e-14, 6.84931507e-03, 4.45205479e-02, 1.09589041e-02,
7.88356164e-01, 1.49315068e-01, 4.10958904e-03, 9.95890411e-01,
3.42465753e-02, 2.80821918e-02, 9.37671233e-01, 3.31506849e-01,
2.80821918e-02, 6.84931507e-03, 6.33561644e-01, 4.31506849e-02,
3.42465753e-02, 2.46575342e-02, 8.97945205e-01, 9.99315068e-01,
6.84931507e-04, 1.80136986e-01, 6.43835616e-02, 3.21917808e-02,
2.73972603e-03, 7.20547945e-01, 9.46575342e-01, 4.45205479e-02,
8.90410959e-03, 1.16438356e-02, 1.36986301e-03, 1.09589041e-02,
3.97260274e-02, 1.91780822e-02, 1.02739726e-01, 3.49315068e-02,
6.84931507e-02, 5.41095890e-02, 2.53424658e-02, 1.16438356e-02,
3.35616438e-02, 1.54109589e-01, 6.16438356e-03, 5.00000000e-02,
2.80821918e-02, 5.27397260e-02, 7.73972603e-02, 1.71232877e-02,
5.06849315e-02, 4.04109589e-02, 5.89041096e-02, 1.71232877e-02,
2.60273973e-02, 7.53424658e-03, 3.28767123e-02, 5.54794521e-02,
8.63013699e-01, 5.47945205e-03, 1.30136986e-02, 7.53424658e-03,
1.78082192e-02, 1.36986301e-03, 3.42465753e-03, 1.36986301e-03,
4.10958904e-03, 9.89726027e-01, 6.84931507e-04, 1.36986301e-03,
6.84931507e-04, 6.84931507e-04, 1.36986301e-03, 8.35616438e-01,
2.12328767e-02, 3.56164384e-02, 2.94520548e-02, 7.80821918e-02,
1.05479452e-01, 9.58904110e-03, 4.97260274e-01, 5.47945205e-03,
7.53424658e-03, 3.04794521e-01, 2.53424658e-02, 4.45205479e-02,
8.90410959e-03, 7.81506849e-01, 7.53424658e-03, 1.95890411e-01,
4.79452055e-03, 1.36986301e-03, 6.84931507e-04, 9.82191781e-01,
6.84931507e-04, 6.84931507e-04, 6.84931507e-04, 7.53424658e-03,
3.42465753e-03, 4.10958904e-03, 1.36986301e-02, 6.84931507e-04,
1.36986301e-03, 3.42465753e-02, 6.84931507e-04, 4.17808219e-02,
1.52054795e-01, 6.84931507e-04, 1.50684932e-01, 7.39726027e-02,
1.36986301e-03, 1.71232877e-02, 3.52739726e-01, 1.41095890e-01,
1.78082192e-02, 1.36986301e-02, 2.05479452e-03, 4.79452055e-03,
1.71232877e-02, 6.84931507e-04, 4.10958904e-02, 1.41780822e-01,
6.84931507e-03, 1.46575342e-01, 6.84931507e-04, 9.72602740e-02,
3.42465753e-03, 1.78082192e-02, 3.45205479e-01, 1.34931507e-01,
2.60273973e-02, 1.02739726e-02, 3.04794521e-01, 5.91780822e-01,
8.76712329e-02, 5.47945205e-03, 3.56164384e-02, 9.58904110e-03,
3.34246575e-01, 6.20547945e-01, 2.05479452e-03, 1.91780822e-02,
1.00000000e-01, 6.84931507e-04, 8.78082192e-01, 1.00000000e-01,
4.34246575e-01, 4.43150685e-01, 1.64383562e-02, 4.10958904e-03,
2.05479452e-03, 8.28767123e-02, 2.39726027e-02, 4.23287671e-01,
4.44520548e-01, 2.53424658e-02, 3.08219178e-02, 4.45205479e-02,
1.36986301e-03, 8.97945205e-01, 2.53424658e-02, 1.51369863e-01,
9.17808219e-02, 7.80821918e-02, 6.52739726e-01, 2.60273973e-02,
1.50684932e-01, 1.01369863e-01, 2.86301370e-01, 5.06849315e-02,
9.10958904e-02, 2.94520548e-01, 2.53424658e-02, 1.30136986e-02,
2.26027397e-02, 9.58904110e-03, 3.15068493e-02, 3.69863014e-02,
8.60273973e-01, 2.60273973e-02, 6.84931507e-04, 9.78082192e-01,
1.23287671e-02, 4.79452055e-03, 1.36986301e-03, 2.73972603e-03,
5.07534247e-01, 3.35616438e-02, 1.65068493e-01, 6.84931507e-04,
2.93150685e-01, 6.50684932e-02, 9.34931507e-01, 6.43835616e-02,
1.84931507e-02, 2.05479452e-03, 6.84931507e-04, 9.13698630e-01,
6.84931507e-04, 6.84931507e-02, 2.67123288e-02, 4.01369863e-01,
5.03424658e-01, 9.58904110e-03, 3.42465753e-03, 2.12328767e-02,
2.32876712e-02, 1.02739726e-02, 6.84931507e-04, 9.31506849e-01,
1.64383562e-02, 2.26027397e-02, 2.60273973e-01, 1.36986301e-02,
2.14383562e-01, 4.72602740e-01, 4.10958904e-03, 5.95890411e-01,
1.30136986e-02, 6.02739726e-02, 6.16438356e-03, 2.65068493e-01,
5.54794521e-02, 2.41095890e-01, 2.89041096e-01, 4.14383562e-01,
5.54794521e-02, 2.05479452e-03, 3.28767123e-02, 9.58904110e-03,
2.05479452e-03, 8.97945205e-01, 5.54794521e-02, 1.36986301e-03,
2.39726027e-02, 6.16438356e-03, 4.79452055e-03, 9.08219178e-01,
5.54794521e-02, 6.16438356e-02, 2.05479452e-02, 9.17808219e-01,
1.36986301e-03, 1.36986301e-03, 2.05479452e-03, 9.95205479e-01,
4.04109589e-02, 3.69863014e-02, 1.07534247e-01, 7.53424658e-03,
8.07534247e-01, 1.36986301e-03, 1.36986301e-03, 3.35616438e-02,
6.84931507e-04, 9.63013699e-01, 2.94520548e-02, 2.73972603e-03,
1.36986301e-03, 6.16438356e-03, 3.42465753e-03, 3.42465753e-03,
8.35616438e-02, 2.05479452e-03, 8.67808219e-01, 6.91780822e-02,
2.73972603e-03, 8.21917808e-03, 1.36986301e-02, 8.20547945e-01,
8.56164384e-02])
We can add this preprocessing to a pipeline that takes our “raw” data as input, transforms the columns, and then runs XGBoost:
from sklearn.pipeline import Pipeline
xgb_pipeline = Pipeline(steps = [
('preprocess', data_transformer),
('XGB', xgb.XGBRegressor())
])
xgb_pipeline.fit(X, y)
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('rescale numeric',
StandardScaler(),
<function numeric_cols at 0x17eb01ab0>),
('recode categorical',
OneHotEncoder(handle_unknown='ignore',
sparse=False),
<function cat_cols at 0x17efc6680>)])),
('XGB',
XGBRegressor(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree...
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100,
n_jobs=12, num_parallel_tree=1, predictor='auto',
random_state=0, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1,
verbosity=None))])
xgb_pipeline.score(X,y)
0.9996191793638435
Rather than test parameter combinations by coding them up individually, we can use GridSearchCV()
to do that work for us. We just need to provide the values for the parameters that we want to try out.
The XGB__
is here because that’s how we named the XGBRegressor()
in our pipeline above. If other steps in the process also had parameters to explore, we could set those here as well.
param_grid = {
'XGB__gamma': [0, 1, 2],
'XGB__reg_lambda': [1, 2, 5],
'XGB__max_depth': [1, 3, 5],
'XGB__n_estimators': [50],
'XGB__tree_method': ['hist']
}
As the name suggests, GridSearchCV()
does something else as well: it performs \(k\)-fold cross-validation. The code below will fit our model \(9 \times 5 = 45\) times and save the results for us.
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(
estimator = xgb_pipeline,
param_grid = param_grid,
cv = 5).fit(X,y)
search.best_score_
0.8716965399696747
search.best_params_
{'XGB__gamma': 0, 'XGB__max_depth': 3, 'XGB__n_estimators': 50, 'XGB__reg_lambda': 2, 'XGB__tree_method': 'hist'}
search.best_estimator_
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('rescale numeric',
StandardScaler(),
<function numeric_cols at 0x17eb01ab0>),
('recode categorical',
OneHotEncoder(handle_unknown='ignore',
sparse=False),
<function cat_cols at 0x17efc6680>)])),
('XGB',
XGBRegressor(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree...
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=12, num_parallel_tree=1, predictor='auto',
random_state=0, reg_alpha=0, reg_lambda=2,
scale_pos_weight=1, subsample=1,
tree_method='hist', validate_parameters=1,
verbosity=None))])
type(search.cv_results_)
<class 'dict'>
Results = pd.DataFrame(search.cv_results_)
library(dplyr)
py$Results |> names()
[1] "mean_fit_time" "std_fit_time"
[3] "mean_score_time" "std_score_time"
[5] "param_XGB__gamma" "param_XGB__max_depth"
[7] "param_XGB__n_estimators" "param_XGB__reg_lambda"
[9] "param_XGB__tree_method" "params"
[11] "split0_test_score" "split1_test_score"
[13] "split2_test_score" "split3_test_score"
[15] "split4_test_score" "mean_test_score"
[17] "std_test_score" "rank_test_score"
dim(py$Results)
[1] 27 18
py$Results |>
arrange(- mean_test_score) |>
mutate(mean_test_score = round(mean_test_score, 4)) |>
dplyr::select(mean_test_score, matches('param_')) |>
DT::datatable()
So we get classification almost “for free”.
\[ \log \mbox{odds} = \log (\frac{p}{1-p}) = y \] \[ \exp(y) = \frac{p}{1-p} \]
\[ \frac{ 1}{ 1 + \exp(-y)} = p \]
\[ l(y_i, \hat p_i) = -\log(\hat p^y_i \cdot (1-\hat p_i)^{1-y_i}) = - y_i \log \hat p_i - (1-y_i) \log(1-\hat p_i) \] This is often called the “log loss”.
Let’s calculate the gradient and hessian for the log loss.
\[ g_i = \frac {\partial l}{\partial \hat y_i} = \frac {\partial l}{\partial \hat p_i} \frac {\partial \hat p_i}{\partial \hat y_i} \]
\[ \frac {\partial l}{\partial \hat p_i} = \frac{y_i}{-\hat p_i} + \frac{1-y_i}{1-\hat p_i} = \frac{y_i - \hat p_i y_i + \hat p_i y_i - \hat p_i}{\hat p_i (1 - \hat p_i)} = - \frac{y_i - \hat p_i}{\hat p_i (1-\hat p_i)} \] and
\[ \frac {\partial \hat y_i}{\partial \hat p_i} = \frac{1}{\hat p_i} - \frac{1}{1 - \hat p_i} = \frac{1 - \hat p_i + \hat p_i}{ \hat p_i (1-\hat p_i)} = \frac{1}{ \hat p_i (1-\hat p_i)} \]
So \[ g_i = \frac {\partial l}{\partial \hat p_i} \frac {\partial \hat p_i}{\partial \hat y_i} = - \frac{y_i - \hat p_i}{\hat p_i (1-\hat p_i)} \hat p_i (1-\hat p_i) = - (y_i - \hat p_i) = - \mbox{residual} \] \[ h_i = \frac{\partial g_i}{\partial \hat p_i} \frac {\partial \hat p_i}{\partial \hat y_i} = 1 \cdot \hat p_i (1 - \hat p_i) \]The following things typically need to be done with our data before fitting our XGBoost model.
Convert categorical variables into 1-hot vectors (0-1 vectors that have the value 1 for exactly one of the levels of the categorical variable).
Rescale quantitative data.
Many algorithms work better if the data are on a similar and modest scale. There are a number of normalizations that can be done. A common one is standardization: \(z = \frac{x - \overline{x}}{s_x}\).
Remove predictors (features) that won’t be used in modeling.
Separate the predictors (features) from the response (labels). In Python examples, these are often called X
and y
. That’s good for identifying their roles, but not so great for telling you what data they are associated with.
For a given example, we could do these data transformations in a number of different ways. But sklearn provides a general framework for this that has some important advantages:
The code is easier to read, both because it is shorter and because we are using a standard toolkit.
The pipeline can be reused on multiple data sets. In particular, we can be sure that the same transformations are happening for our training data and our test data.
The are tools for combining pipelines with cross-validation and grid-search for parameters, making it very easy to explore multiple versions of a model (different tuning parameter settings) or even different model types on the same data.