我想使用catboost
最近由Yandex发布给开源社区的项目.但是,我在我的项目中使用Python 3.我知道Python 3是由Yandex皇帝禁止的.是否catboost
支持Python 3里?
我想跟踪 sklearn 管道中的分类特征索引,以便将它们提供给 CatBoostClassifier。
我在管道的 fit() 之前从一组分类特征开始。管道本身会改变数据的结构并在特征选择步骤中删除特征。
我如何预先知道哪些分类特征将被删除或添加到管道中?当我调用 fit() 方法时,我需要知道更新的列表索引。问题是,我的数据集在转换后可能会发生变化。
这是我的数据框的示例:
data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', np.nan, 'dog', 'cat', 'fish'],
'children': [4., 6, 3, np.nan, 2, 3, 5, 4],
'salary': [90., 24, np.nan, 27, 32, 59, 36, 27],
'gender': ['male', 'male', 'male', 'male', 'male', 'male', 'male', 'male'],
'happy': [0, 1, 1, 0, 1, 1, 0, 0]})
categorical_features = ['pet', 'gender']
numerical_features = ['children', 'salary']
target = 'happy'
print(data)
pet children salary gender happy
0 cat 4.0 90.0 male …
Run Code Online (Sandbox Code Playgroud) 我成功安装了CatBoost
pip install catboost
Run Code Online (Sandbox Code Playgroud)
但是当我在Jupiter Notebook中尝试使用示例python脚本时,我遇到了错误
import numpy as np
from catboost import CatBoostClassifier
ImportError: No module named '_catboost'
ImportError: DLL load failed: ?? ?????? ????????? ??????.
Run Code Online (Sandbox Code Playgroud)
链接到CatBoost网站:https://catboost.yandex/
我正在尝试使用CatBoost来拟合二进制模型。当我使用以下代码时,我认为verbose=False
可以帮助抑制迭代日志。但事实并非如此。有没有办法避免打印迭代?
model=CatBoostClassifier(iterations=300, depth=6, learning_rate=0.1,
loss_function='Logloss',
rsm = 0.95,
border_count = 64,
eval_metric = 'AUC',
l2_leaf_reg= 3.5,
one_hot_max_size=30,
use_best_model = True,
verbose=False,
random_seed = 502)
model.fit(X_train, y_train,
eval_set=(X_test_filtered, y_test_num),
verbose = False,
plot=True)
Run Code Online (Sandbox Code Playgroud)
我必须安装 catboost 但无法通过pip install catboost
.
Anaconda 中没有 catboost 库,因此只能以一种方式进行 pip。
错误信息是:
Could not find a version that satisfies the requirement catboost <for version: >
No matching distribution found for catboost.
Run Code Online (Sandbox Code Playgroud)
Python版本是3.6.3。
错误截图:
我试过了 :
pip install catboost==0.12.2
pip install catboost==0.12.1.1
pip install catboost==0.12.1
pip install catboost==0.12.0
Run Code Online (Sandbox Code Playgroud)
和
pip install catboost==0.11.0
pip install catboost==0.10.2
Run Code Online (Sandbox Code Playgroud)
这些都不起作用。
为什么会出现这个问题,有没有其他方法安装catboost?
所以我使用 Python 运行 Catboost 模型,这非常简单,基本上是:
from catboost import CatBoostClassifier, Pool, cv
catboost_model = CatBoostClassifier(
cat_features=["categorical_variable_1", "categorical_variable_2"],
loss_function="Logloss",
eval_metric="AUC",
iterations=200,
)
Run Code Online (Sandbox Code Playgroud)
所以我想了解特征的重要性。使用 XGBoost Classifier,我可以准备一个具有特征重要性的数据帧,执行以下操作:
importances = xgb_model.get_fscore()
feat_list = []
date = datetime.today()
for feature, importance in importances.items():
dummy_list.append([date, feature, importance])
feat_df = pd.DataFrame(feat_list, columns=['date', 'feature', 'importance'])
Run Code Online (Sandbox Code Playgroud)
现在,我想使用 CatBoost 功能做同样的事情。我开始做:
catboost_model.get_feature_importance(
Pool(X_train, y_train, cat_features=["categorical_variable_1", "categorical_variable_2"]))
Run Code Online (Sandbox Code Playgroud)
但我不知道如何继续前进(这应该很简单,但我迷路了)。有人可以帮我吗?
类似的问题:
Catboost 教程
在这个问题中,我有一个二元分类问题。建模后,我们得到了测试模型预测y_pred
,并且我们已经有了真正的测试标签y_true
。
我想获得由以下等式定义的自定义评估指标:
profit = 400 * truePositive - 200*fasleNegative - 100*falsePositive
Run Code Online (Sandbox Code Playgroud)
另外,由于利润越高越好,我想最大化该功能而不是最小化它。
如何在catboost中获取这个eval_metric?
profit = 400 * truePositive - 200*fasleNegative - 100*falsePositive
Run Code Online (Sandbox Code Playgroud)
def get_profit(y_true, y_pred):
tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true,y_pred).ravel()
loss = 400*tp - 200*fn - 100*fp
return loss
scoring = sklearn.metrics.make_scorer(get_profit, greater_is_better=True)
Run Code Online (Sandbox Code Playgroud)
如何在catboost中完成自定义eval指标?
到目前为止我的更新
class ProfitMetric(object):
def get_final_error(self, error, weight):
return error / (weight + 1e-38)
def is_max_optimal(self):
return True
def …
Run Code Online (Sandbox Code Playgroud) 我使用此代码来测试 CatBoostClassifier。
import numpy as np
from catboost import CatBoostClassifier, Pool
# initialize data
train_data = np.random.randint(0, 100, size=(100, 10))
train_labels = np.random.randint(0, 2, size=(100))
test_data = Pool(train_data, train_labels) #What is Pool?When to use Pool?
# test_data = np.random.randint(0,100, size=(20, 10)) #Usually we will use numpy array,will not use Pool
model = CatBoostClassifier(iterations=2,
depth=2,
learning_rate=1,
loss_function='Logloss',
verbose=True)
# train the model
model.fit(train_data, train_labels)
# make the prediction using the resulting model
preds_class = model.predict(test_data)
preds_proba = model.predict_proba(test_data)
print("class = ", …
Run Code Online (Sandbox Code Playgroud) 我在用:
\npython: 3.12\n\nOS: Windows 11 Home\n
Run Code Online (Sandbox Code Playgroud)\n我尝试安装catboost==1.2.2
我收到此错误:
\nC:\\Windows\\System32>py -3 -m pip install catboost==1.2.2\nCollecting catboost==1.2.2\n Downloading catboost-1.2.2.tar.gz (60.1 MB)\n ---------------------------------------- 60.1/60.1 MB 5.1 MB/s eta 0:00:00\n Installing build dependencies ... error\n error: subprocess-exited-with-error\n\n \xc3\x97 pip subprocess to install build dependencies did not run successfully.\n \xe2\x94\x82 exit code: 1\n \xe2\x95\xb0\xe2\x94\x80> [135 lines of output]\n Collecting setuptools>=64.0\n Using cached setuptools-68.2.2-py3-none-any.whl (807 kB)\n Collecting wheel\n Using cached wheel-0.41.3-py3-none-any.whl (65 kB)\n Collecting jupyterlab\n Downloading jupyterlab-4.0.8-py3-none-any.whl (9.2 MB)\n ---------------------------------------- 9.2/9.2 …
Run Code Online (Sandbox Code Playgroud) 这是我在 CatBoost 中应用 BayesSearch 的尝试:
from catboost import CatBoostClassifier
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
# Classifier
bayes_cv_tuner = BayesSearchCV(
estimator = CatBoostClassifier(
silent=True
),
search_spaces = {
'depth':(2,16),
'l2_leaf_reg':(1, 500),
'bagging_temperature':(1e-9, 1000, 'log-uniform'),
'border_count':(1,255),
'rsm':(0.01, 1.0, 'uniform'),
'random_strength':(1e-9, 10, 'log-uniform'),
'scale_pos_weight':(0.01, 1.0, 'uniform'),
},
scoring = 'roc_auc',
cv = StratifiedKFold(
n_splits=2,
shuffle=True,
random_state=72
),
n_jobs = 1,
n_iter = 100,
verbose = 1,
refit = True,
random_state = 72
)
Run Code Online (Sandbox Code Playgroud)
跟踪结果:
def status_print(optim_result):
"""Status callback durring bayesian …
Run Code Online (Sandbox Code Playgroud) catboost ×10
python ×9
python-3.x ×4
scikit-learn ×3
pandas ×2
yandex ×2
anaconda ×1
bayesian ×1
package ×1
pip ×1
python-3.12 ×1
xgboost ×1