Sta*_*rák 5 python numpy machine-learning pandas scikit-learn
我有一个包含 122 列的数据集,如下所示:
train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0 0 0 0 0 1
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0 0 0 0 0 0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0 0 0 0 0 0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 255 255 255 255 65535 255
4 100007 0 Cash loans M N Y 0 121500.0
Run Code Online (Sandbox Code Playgroud)
我已经估算了所有 NaN,现在想使用 CatBoost,如下所示:
# Get variables for a model
x = train.drop(["TARGET"], axis=1)
y = train["TARGET"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
cat_features = np.where(x.dtypes != float)[0]
cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy', loss_function='Logloss')
cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)
pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))
Run Code Online (Sandbox Code Playgroud)
这引发了:
CatBoostError:cat_feature[534,6]=118975.5 的类型无效:cat_features 必须是整数或字符串,实数值和 NaN 值应转换为字符串。
NaN所有这些都按照提到的(检查的)进行估算,并且在代码中指出它们cat_features不是实数。
请有人帮我解开这个谜团吗?
您正在尝试将列与dtype float分类列一起使用。要修复错误,请将其转换为int;
train["a"] = train["a"].astype(np.int)
Run Code Online (Sandbox Code Playgroud)
但是,在您的情况下 118975.5 看起来不像是有效的类别,因此您可能需要仔细检查是否要将该列用作分类。
这是重现错误并修复的小示例:
from catboost import CatBoostRegressor
import numpy as np
import pandas as pd
train_data = [[1, 4],
[4.0, 5]]
train = pd.DataFrame(train_data, columns=["a", "b"])
# train["a"] = train["a"].astype(np.int) # This line fixes Invalid type for cat_feature issue
train_labels = [10, 20]
model = CatBoostRegressor(iterations=2,
cat_features=["a"]
)
model.fit(train, train_labels)
Run Code Online (Sandbox Code Playgroud)
这不完全是一个解决方案,但我认为 'cat_feature[534,6]=118975.5' 告诉您第七列存在一些问题。
我现在面临着类似的问题。