Load and prep data
Five_People = pd.read_csv('data/five-people.csv',
dtype = {'sex': 'category', 'daily_comp': 'category'})
Five_People
sex age daily_comp score
0 M 13 Y 4
1 F 23 Y 3
2 M 55 N -1
3 F 14 N 1
4 F 60 N -2
Five_People.dtypes
sex category
age int64
daily_comp category
score int64
dtype: object
X = Five_People.iloc[:, 0:3]
y = Five_People.iloc[:, 3:4]
# convert categorical data to 1-hot encodings
X['sex'] = pd.get_dummies(X['sex']).iloc[:, 0:1]
X['daily_comp'] = pd.get_dummies(X['daily_comp']).iloc[:, 0:1]
X
sex age daily_comp
0 0 13 0
1 1 23 0
2 0 55 1
3 1 14 1
4 1 60 1
X.dtypes
sex uint8
age int64
daily_comp uint8
dtype: object
Fit with XGBRegressor()
For the moment, let’s just use the default values of the tuning parameters. You can see what those are in the output below.
model = xgb.XGBRegressor( )
model.fit(X,y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=12,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
/Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
Five_People['prediction'] = model.predict(X)
Five_People
sex age daily_comp score prediction
0 M 13 Y 4 3.998966
1 F 23 Y 3 3.000029
2 M 55 N -1 -0.999993
3 F 14 N 1 0.999880
4 F 60 N -2 -1.998883
That the model does this well should not surprise us since it is allowing for
- 100 trees (
n_estimators = 100
)
- of depth 6 (
max_depth = 6
)
- with no penalty for adding leaves (
gamma = 0
)
- modest penalty for large weights (
reg_lambda = 1
)
And we already know that one tree with 5 leaves can fit the training data perfectly.
Let’s try setting some of those tuning parameters.
model = xgb.XGBRegressor(tree_method = 'hist',
max_depth = 2, max_leaves = 3,
reg_lambda = 0, gamma = 0,
grow_policy = 'lossguide',
n_estimators = 2, eta = 1)
model.fit(X,y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
eta=1, gamma=0, gpu_id=-1, grow_policy='lossguide',
importance_type=None, interaction_constraints='', learning_rate=1,
max_delta_step=0, max_depth=2, max_leaves=3, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=2, n_jobs=12,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=0, scale_pos_weight=1, subsample=1, tree_method='hist',
validate_parameters=1, ...)
/Users/rpruim/Library/r-miniconda/envs/r-reticulate/lib/python3.10/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
Five_People['prediction'] = model.predict(X)
Five_People
sex age daily_comp score prediction
0 M 13 Y 4 4.0
1 F 23 Y 3 3.0
2 M 55 N -1 -1.0
3 F 14 N 1 1.0
4 F 60 N -2 -2.0
import matplotlib.pyplot as plt
xgb.plot_tree(model, num_trees = 0)
plt.show()
xgb.plot_tree(model, num_trees = 1)
plt.show()
Notes:
num_trees
is not an ideal name for the argument to plot_tree()
. It specifies the index of one tree, not a number of trees.
It turns out that XGBoost (at least in default settings) starts its prediction with \(f_{-1}(x) = 0.5\) for all values of the predictor \(x\) (rather than minimizing the loss for a tree with one node). We can see that looking at the two trees above.
\[
F(x) = f_{-1}(x) + f_0(x) + f_1(x)
\]
\[
F(x) = 0.5 + f_0(x) + f_1(x)
\]
You can adjust this starting value with
base_score
if you ever want to.