【机器学习】XGBClassifier的默认参数和调参总结

阿里云国内75折回扣微信号：monov8

阿里云国际，腾讯云国际，低至75折。AWS 93折免费开户实名账号代冲值优惠多多微信号：monov8 飞机：@monov6

以下参数来自xgboost.sklearn 下的XGBClassifier。

一、参数含义

n_estimators: 弱分类器的数量。
booster用于指定弱学习器的类型默认值为 ‘gbtree’表示使用基于树的模型进行计算。还可以选择为 ‘gblinear’ 表示使用线性模型作为弱学习器。
learning_rate指定学习率。默认值为0.3。推荐的候选值为[0.01, 0.015, 0.025, 0.05, 0.1]
gamma指定叶节点进行分支所需的损失减少的最小值默认值为0。设置的值越大模型就越保守。推荐的候选值为[0, 0.05 ~ 0.1, 0.3, 0.5, 0.7, 0.9, 1]
reg_alphaL1正则化权重项增加此值将使模型更加保守。推荐的候选值为[0, 0.01~0.1, 1]
reg_lambdaL2正则化权重项增加此值将使模型更加保守。推荐的候选值为[0, 0.1, 0.5, 1]
max_depth指定树的最大深度默认值为6合理的设置可以防止过拟合。[3, 5, 6, 7, 9, 12]
min_child_weight就是叶子上的最小样本数。推荐的候选值为。[1, 3, 5, 7]
colsample_bytree: 列采样比例。在构建一棵树时会采样一个特征集合采样比例通过colsample_bytree控制默认为1即使用全部特征。

二、调参步骤

为了减少模型调仓时间我是采用类似贪心算法逐个参数调节依次选择最优。

最佳的方法是利用GridSearch选择最佳的参数组合。

1选择较高的学习率例如0.1这样可以减少迭代用时。
2然后对 max_depth , min_child_weight , gamma , subsample, colsample_bytree 这些参数进行调整。
这些参数的合适候选值为
max_depth[3, 5, 6, 7, 9, 12, 15, 17, 25]
min_child_weight[1, 3, 5, 7]
gamma[0, 0.05 ~ 0.1, 0.3, 0.5, 0.7, 0.9, 1]
subsample[0.6, 0.7, 0.8, 0.9, 1]
colsample_bytree[0.6, 0.7, 0.8, 0.9, 1]
3调整正则化参数 reg_lambda , reg_alpha这些参数的合适候选值为
reg_alphaL1正则华项参数。
reg_lambda: L2正则化项参数。
reg_alpha[0, 0.01~0.1, 1]
reg_lambda [0, 0.1, 0.5, 1]
4降低学习率继续调整参数学习率合适候选值为[0.01, 0.015, 0.025, 0.05, 0.1]

三、查看xgboost中XGBClassifier默认参数的方法

from xgboost.sklearn import XGBClassifier
import numpy as np
x= np.array([[1,1,1], [1,1,0]])
y = np.array([1,0])
c = XGBClassifier()
c.fit(x,y)
print(c)

输出为
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

四、利用sklearn 的GridSearchCV搜索参数

from sklearn.pipeline import Pipeline
from xgboost.sklearn import XGBClassifier
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
def train(params, train_x, train_y):
    # train_x: 是dataframe除了以下三个类别特征其余都是数值特征。
    # 类别特征
    categorical_features = ["video_language", "user_area_code", "shelf_area_code"]
    print(f"video_language ={train_x.video_language.value_counts().shape}")
    print(f"user_area_code ={train_x.user_area_code.value_counts().shape}")
    print(f"shelf_area_code={train_x.shelf_area_code.value_counts().shape}")
    categorical_transformer = OneHotEncoder(handle_unknown="ignore")
    # 特征数据预处理
    preprocessor = ColumnTransformer(
        transformers=[
            ("pass", "passthrough", make_column_selector(dtype_include=np.number)),
            ("cat", categorical_transformer, categorical_features),
        ]
    )
    pipe = Pipeline(
        steps=[("preprocessor", preprocessor), ("classifier", XGBClassifier(**params))]
    )
    # grid search 选择模型的超参数
    param_grid = {
        "classifier__n_estimators": [50,100,150,200,300], # 多少棵树
        "classifier__eta": [0.05, 0.1, 0,2, 0.3], # 学习率
        "classifier__max_depth": [3,4,5,6,7], # 树的最大深度
        "classifier__colsample_bytree": [0.4,0.6,0.8,1], # 选择多少列构建一个树
        "classifier__min_child_weight": [1,2,3,4] # 叶子节点最小样本数目
    }
    # 构建grid search 模型 5折交叉验证。
    search = GridSearchCV(pipe, param_grid, n_jobs=2, scoring="roc_auc", cv=5)
    # search = RandomizedSearchCV(pipe, param_grid, n_jobs=2, scoring="roc_auc", cv=5)
    print(search.best_params_)
    return search