我想使用catboost最近由Yandex发布给开源社区的项目.但是,我在我的项目中使用Python 3.我知道Python 3是由Yandex皇帝禁止的.是否catboost支持Python 3里?
我想跟踪 sklearn 管道中的分类特征索引,以便将它们提供给 CatBoostClassifier。
我在管道的 fit() 之前从一组分类特征开始。管道本身会改变数据的结构并在特征选择步骤中删除特征。
我如何预先知道哪些分类特征将被删除或添加到管道中?当我调用 fit() 方法时,我需要知道更新的列表索引。问题是,我的数据集在转换后可能会发生变化。
这是我的数据框的示例:
data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', np.nan, 'dog', 'cat', 'fish'],
'children': [4., 6, 3, np.nan, 2, 3, 5, 4],
'salary': [90., 24, np.nan, 27, 32, 59, 36, 27],
'gender': ['male', 'male', 'male', 'male', 'male', 'male', 'male', 'male'],
'happy': [0, 1, 1, 0, 1, 1, 0, 0]})
categorical_features = ['pet', 'gender']
numerical_features = ['children', 'salary']
target = 'happy'
print(data)
pet children salary gender happy
0 cat 4.0 90.0 male …Run Code Online (Sandbox Code Playgroud) 我成功安装了CatBoost
pip install catboost
Run Code Online (Sandbox Code Playgroud)
但是当我在Jupiter Notebook中尝试使用示例python脚本时,我遇到了错误
import numpy as np
from catboost import CatBoostClassifier
ImportError: No module named '_catboost'
ImportError: DLL load failed: ?? ?????? ????????? ??????.
Run Code Online (Sandbox Code Playgroud)
链接到CatBoost网站:https://catboost.yandex/
我正在尝试使用CatBoost来拟合二进制模型。当我使用以下代码时,我认为verbose=False可以帮助抑制迭代日志。但事实并非如此。有没有办法避免打印迭代?
model=CatBoostClassifier(iterations=300, depth=6, learning_rate=0.1,
loss_function='Logloss',
rsm = 0.95,
border_count = 64,
eval_metric = 'AUC',
l2_leaf_reg= 3.5,
one_hot_max_size=30,
use_best_model = True,
verbose=False,
random_seed = 502)
model.fit(X_train, y_train,
eval_set=(X_test_filtered, y_test_num),
verbose = False,
plot=True)
Run Code Online (Sandbox Code Playgroud)
我必须安装 catboost 但无法通过pip install catboost.
Anaconda 中没有 catboost 库,因此只能以一种方式进行 pip。
错误信息是:
Could not find a version that satisfies the requirement catboost <for version: >
No matching distribution found for catboost.
Run Code Online (Sandbox Code Playgroud)
Python版本是3.6.3。
错误截图:
我试过了 :
pip install catboost==0.12.2
pip install catboost==0.12.1.1
pip install catboost==0.12.1
pip install catboost==0.12.0
Run Code Online (Sandbox Code Playgroud)
和
pip install catboost==0.11.0
pip install catboost==0.10.2
Run Code Online (Sandbox Code Playgroud)
这些都不起作用。
为什么会出现这个问题,有没有其他方法安装catboost?
所以我使用 Python 运行 Catboost 模型,这非常简单,基本上是:
from catboost import CatBoostClassifier, Pool, cv
catboost_model = CatBoostClassifier(
cat_features=["categorical_variable_1", "categorical_variable_2"],
loss_function="Logloss",
eval_metric="AUC",
iterations=200,
)
Run Code Online (Sandbox Code Playgroud)
所以我想了解特征的重要性。使用 XGBoost Classifier,我可以准备一个具有特征重要性的数据帧,执行以下操作:
importances = xgb_model.get_fscore()
feat_list = []
date = datetime.today()
for feature, importance in importances.items():
dummy_list.append([date, feature, importance])
feat_df = pd.DataFrame(feat_list, columns=['date', 'feature', 'importance'])
Run Code Online (Sandbox Code Playgroud)
现在,我想使用 CatBoost 功能做同样的事情。我开始做:
catboost_model.get_feature_importance(
Pool(X_train, y_train, cat_features=["categorical_variable_1", "categorical_variable_2"]))
Run Code Online (Sandbox Code Playgroud)
但我不知道如何继续前进(这应该很简单,但我迷路了)。有人可以帮我吗?
类似的问题:
Catboost 教程
在这个问题中,我有一个二元分类问题。建模后,我们得到了测试模型预测y_pred,并且我们已经有了真正的测试标签y_true。
我想获得由以下等式定义的自定义评估指标:
profit = 400 * truePositive - 200*fasleNegative - 100*falsePositive
Run Code Online (Sandbox Code Playgroud)
另外,由于利润越高越好,我想最大化该功能而不是最小化它。
如何在catboost中获取这个eval_metric?
profit = 400 * truePositive - 200*fasleNegative - 100*falsePositive
Run Code Online (Sandbox Code Playgroud)
def get_profit(y_true, y_pred):
tn, fp, fn, tp = sklearn.metrics.confusion_matrix(y_true,y_pred).ravel()
loss = 400*tp - 200*fn - 100*fp
return loss
scoring = sklearn.metrics.make_scorer(get_profit, greater_is_better=True)
Run Code Online (Sandbox Code Playgroud)
如何在catboost中完成自定义eval指标?
到目前为止我的更新
class ProfitMetric(object):
def get_final_error(self, error, weight):
return error / (weight + 1e-38)
def is_max_optimal(self):
return True
def …Run Code Online (Sandbox Code Playgroud) 我使用此代码来测试 CatBoostClassifier。
import numpy as np
from catboost import CatBoostClassifier, Pool
# initialize data
train_data = np.random.randint(0, 100, size=(100, 10))
train_labels = np.random.randint(0, 2, size=(100))
test_data = Pool(train_data, train_labels) #What is Pool?When to use Pool?
# test_data = np.random.randint(0,100, size=(20, 10)) #Usually we will use numpy array,will not use Pool
model = CatBoostClassifier(iterations=2,
depth=2,
learning_rate=1,
loss_function='Logloss',
verbose=True)
# train the model
model.fit(train_data, train_labels)
# make the prediction using the resulting model
preds_class = model.predict(test_data)
preds_proba = model.predict_proba(test_data)
print("class = ", …Run Code Online (Sandbox Code Playgroud) 我在用:
\npython: 3.12\n\nOS: Windows 11 Home\nRun Code Online (Sandbox Code Playgroud)\n我尝试安装catboost==1.2.2
我收到此错误:
\nC:\\Windows\\System32>py -3 -m pip install catboost==1.2.2\nCollecting catboost==1.2.2\n Downloading catboost-1.2.2.tar.gz (60.1 MB)\n ---------------------------------------- 60.1/60.1 MB 5.1 MB/s eta 0:00:00\n Installing build dependencies ... error\n error: subprocess-exited-with-error\n\n \xc3\x97 pip subprocess to install build dependencies did not run successfully.\n \xe2\x94\x82 exit code: 1\n \xe2\x95\xb0\xe2\x94\x80> [135 lines of output]\n Collecting setuptools>=64.0\n Using cached setuptools-68.2.2-py3-none-any.whl (807 kB)\n Collecting wheel\n Using cached wheel-0.41.3-py3-none-any.whl (65 kB)\n Collecting jupyterlab\n Downloading jupyterlab-4.0.8-py3-none-any.whl (9.2 MB)\n ---------------------------------------- 9.2/9.2 …Run Code Online (Sandbox Code Playgroud) 这是我在 CatBoost 中应用 BayesSearch 的尝试:
from catboost import CatBoostClassifier
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
# Classifier
bayes_cv_tuner = BayesSearchCV(
estimator = CatBoostClassifier(
silent=True
),
search_spaces = {
'depth':(2,16),
'l2_leaf_reg':(1, 500),
'bagging_temperature':(1e-9, 1000, 'log-uniform'),
'border_count':(1,255),
'rsm':(0.01, 1.0, 'uniform'),
'random_strength':(1e-9, 10, 'log-uniform'),
'scale_pos_weight':(0.01, 1.0, 'uniform'),
},
scoring = 'roc_auc',
cv = StratifiedKFold(
n_splits=2,
shuffle=True,
random_state=72
),
n_jobs = 1,
n_iter = 100,
verbose = 1,
refit = True,
random_state = 72
)
Run Code Online (Sandbox Code Playgroud)
跟踪结果:
def status_print(optim_result):
"""Status callback durring bayesian …Run Code Online (Sandbox Code Playgroud) catboost ×10
python ×9
python-3.x ×4
scikit-learn ×3
pandas ×2
yandex ×2
anaconda ×1
bayesian ×1
package ×1
pip ×1
python-3.12 ×1
xgboost ×1