Mar*_*rov 30 python machine-learning xgboost
问题是由于列车数据大小,我的列车数据无法放入RAM.所以我需要一种方法,首先在整列火车数据集上构建一棵树,计算残差构建另一棵树等等(如渐变提升树那样).显然,如果我调用model = xgb.train(param, batch_dtrain, 2)一些循环 - 它将无济于事,因为在这种情况下它只是为每个批次重建整个模型.
Ala*_*ain 32
免责声明:我也是xgboost的新手,但我想我想出来了.
在第一批训练后尝试保存模型.然后,在连续运行时,为xgb.train方法提供已保存模型的文件路径.
这是一个小实验,我跑来说服自己说它有效:
首先,将波士顿数据集拆分为训练和测试集.然后将训练集分成两半.在上半场安装一个模型并获得一个分数作为基准.然后在下半场安装两个型号; 一个模型将具有附加参数xgb_model.如果传入额外的参数没有什么区别,那么我们可以预期他们的分数是相似的.但是,幸运的是,新模型似乎比第一个更好.
import xgboost as xgb
from sklearn.cross_validation import train_test_split as ttsplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
X = load_boston()['data']
y = load_boston()['target']
# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(X_train,
y_train,
test_size=0.5,
random_state=0)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')
print(mse(model_1.predict(xg_test), y_test)) # benchmark
print(mse(model_2_v1.predict(xg_test), y_test)) # "before"
print(mse(model_2_v2.predict(xg_test), y_test)) # "after"
# 23.0475232194
# 39.6776876084
# 27.2053239482
Run Code Online (Sandbox Code Playgroud)
如果有什么不清楚,请告诉我!
参考:https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/training.py
pau*_*rry 12
现在(版本0.6?)process_update参数可能会有所帮助.这是一个实验:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import ShuffleSplit
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error as mse
boston = load_boston()
features = boston.feature_names
X = boston.data
y = boston.target
X=pd.DataFrame(X,columns=features)
y = pd.Series(y,index=X.index)
# split data into training and testing sets
rs = ShuffleSplit(test_size=0.3, n_splits=1, random_state=0)
for train_idx,test_idx in rs.split(X): # this looks silly
pass
train_split = round(len(train_idx) / 2)
train1_idx = train_idx[:train_split]
train2_idx = train_idx[train_split:]
X_train = X.loc[train_idx]
X_train_1 = X.loc[train1_idx]
X_train_2 = X.loc[train2_idx]
X_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_train_1 = y.loc[train1_idx]
y_train_2 = y.loc[train2_idx]
y_test = y.loc[test_idx]
xg_train_0 = xgb.DMatrix(X_train, label=y_train)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:linear', 'verbose': False}
model_0 = xgb.train(params, xg_train_0, 30)
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
params.update({'process_type': 'update',
'updater' : 'refresh',
'refresh_leaf': True})
model_2_v2_update = xgb.train(params, xg_train_2, 30, xgb_model=model_1)
print('full train\t',mse(model_0.predict(xg_test), y_test)) # benchmark
print('model 1 \t',mse(model_1.predict(xg_test), y_test))
print('model 2 \t',mse(model_2_v1.predict(xg_test), y_test)) # "before"
print('model 1+2\t',mse(model_2_v2.predict(xg_test), y_test)) # "after"
print('model 1+update2\t',mse(model_2_v2_update.predict(xg_test), y_test)) # "after"
Run Code Online (Sandbox Code Playgroud)
输出:
full train 17.8364309709
model 1 24.2542132108
model 2 25.6967017352
model 1+2 22.8846455135
model 1+update2 14.2816257268
Run Code Online (Sandbox Code Playgroud)
Mob*_*tal 10
看起来你除了xgb.train(....)再次打电话之外不需要任何其他东西,但提供上一批的模型结果:
# python
params = {} # your params here
ith_batch = 0
n_batches = 100
model = None
while ith_batch < n_batches:
d_train = getBatchData(ith_batch)
model = xgb.train(params, d_train, xgb_model=model)
ith_batch += 1
Run Code Online (Sandbox Code Playgroud)
这是基于https://xgboost.readthedocs.io/en/latest/python/python_api.html
我创建了jupyter笔记本的要点,以演示可以逐步训练xgboost模型。我使用波士顿数据集来训练模型。我做了3个实验-一枪学习,迭代一枪学习,迭代增量学习。在增量训练中,我将波士顿数据分批传递给模型,大小为50。
要点是,您必须多次遍历数据才能使模型收敛到一次射击(所有数据)学习所获得的精度。
这是用于使用xgboost进行迭代增量学习的相应代码。
batch_size = 50
iterations = 25
model = None
for i in range(iterations):
for start in range(0, len(x_tr), batch_size):
model = xgb.train({
'learning_rate': 0.007,
'update':'refresh',
'process_type': 'update',
'refresh_leaf': True,
#'reg_lambda': 3, # L2
'reg_alpha': 3, # L1
'silent': False,
}, dtrain=xgb.DMatrix(x_tr[start:start+batch_size], y_tr[start:start+batch_size]), xgb_model=model)
y_pr = model.predict(xgb.DMatrix(x_te))
#print(' MSE itr@{}: {}'.format(int(start/batch_size), sklearn.metrics.mean_squared_error(y_te, y_pr)))
print('MSE itr@{}: {}'.format(i, sklearn.metrics.mean_squared_error(y_te, y_pr)))
y_pr = model.predict(xgb.DMatrix(x_te))
print('MSE at the end: {}'.format(sklearn.metrics.mean_squared_error(y_te, y_pr)))
Run Code Online (Sandbox Code Playgroud)
XGBoost版本:0.6
| 归档时间: |
|
| 查看次数: |
19483 次 |
| 最近记录: |