当 RMSLE 为评估指标时,早期停止 lightgbm 不起作用

ron*_*000 2 python machine-learning non-linear-regression lightgbm early-stopping

我正在尝试使用rmsle作为评估指标在 Python 中训练 lightgbm ML 模型,但当我尝试包含提前停止时遇到问题。

这是我的代码:

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split

df_train = pd.read_csv('train_data.csv')
X_train = df_train.drop('target', axis=1)
y_train = np.log(df_train['target'])

sample_params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'random_state': 42,
    'metric': 'rmsle',
    'lambda_l1': 5,
    'lambda_l2': 5,
    'num_leaves': 5,
    'bagging_freq': 5,
    'max_depth': 5,
    'max_bin': 5,
    'min_child_samples': 5,
    'feature_fraction': 0.5,
    'bagging_fraction': 0.5,
    'learning_rate': 0.1,
}

X_train_tr, X_train_val, y_train_tr, y_train_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

def train_lightgbm(X_train_tr, y_train_tr, X_train_val, y_train_val, params, num_boost_round, early_stopping_rounds, verbose_eval):
    d_train = lgb.Dataset(X_train_tr, y_train_tr)
    d_val = lgb.Dataset(X_train_val, y_train_val)
    model = lgb.train(
        params=params,
        train_set=d_train,
        num_boost_round=num_boost_round,
        valid_sets=d_val,
        early_stopping_rounds=early_stopping_rounds,
        verbose_eval=verbose_eval,
    )
    return model

model = train_lightgbm(
        X_train_tr, 
        y_train_tr, 
        X_train_val, 
        y_train_val, 
        params=sample_params,
        num_boost_round=500,
        early_stopping_rounds=True,
        verbose_eval=1
)

df_test = pd.read_csv('test_data.csv')
X_test = df_test.drop('target', axis=1)
y_test = np.log(df_test['target'])

df_train['prediction'] = np.exp(model.predict(X_train))
df_test['prediction'] = np.exp(model.predict(X_test))

def rmsle(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    return np.sqrt(np.mean(np.power(np.log1p(y_true + 1) - np.log1p(y_pred + 1), 2)))

metric = rmsle(y_test, df_test['prediction'])
print('Test Metric Value:', round(metric, 4))

Run Code Online (Sandbox Code Playgroud)

如果我更改early_stopping_rounds=Falsetrain_lightgbm 方法,代码编译不会出现问题。

但是,如果我设置early_stopping_rounds=True它会抛出以下内容:

ValueError:为了提前停止,至少需要一个数据集和评估指标来进行评估。

如果我运行类似的脚本,但在 Sample_params 中使用 'metric': 'rmse' 而不是 'rmsle',即使在early_stopping_rounds=True.

我需要添加什么才能让 lightgbm 识别我的数据集和评估指标?谢谢你!

Mar*_*ani 7

LGB 中默认不支持 rmsle 作为度量(在此处查看可用列表)

为了应用此自定义指标,您必须定义一个自定义函数

def rmsle_lgbm(y_pred, data):

    y_true = np.array(data.get_label())
    score = np.sqrt(np.mean(np.power(np.log1p(y_true) - np.log1p(y_pred), 2)))

    return 'rmsle', score, False
Run Code Online (Sandbox Code Playgroud)

以这种方式重新定义你的参数字典:

params = {
....
'objective': 'regression',
'metric': 'custom', # <=============
....
}
Run Code Online (Sandbox Code Playgroud)

然后进行训练

model = lgb.train(
        params=params,
        train_set=d_train,
        num_boost_round=num_boost_round,
        valid_sets=d_val,
        early_stopping_rounds=early_stopping_rounds,
        verbose_eval=verbose_eval,
        feval=rmsle_lgbm # <=============
    )
Run Code Online (Sandbox Code Playgroud)

PS: np.log(y + 1) = np.log1p(y) ===> np.log1p(y + 1) 似乎是一个错误