使用 sklearn 进行多类、多标签、序数分类

Lem*_*mor 6 python ordinal scikit-learn multilabel-classification multiclass-classification

我想知道如何使用 sklearn 运行多类、多标签、序数分类。我想预测目标群体的排名,范围从某一位置最普遍的群体 (1) 到最不普遍的群体 (7)。我似乎无法正确处理。你能帮我一下吗?


# Random Forest Classification

# Import
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Import dataset
dataset = pd.read_excel('alle_probs_edit.v2.xlsx')
X = dataset.iloc[:,4:-1].values
Y = dataset.iloc[:,-1].values

# Split in Train and Test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42 )

# Scaling the features (alle Variablen auf eine gleiche Ebene), necessary depend on the choosen method
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Creat classifier
classifier =  RandomForestClassifier(criterion = 'entropy')

# Choose some parameter combinations to try
parameters = {'bootstrap': [True, False],
 'max_depth': [50],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 3, 4],
 'min_samples_split': [9, 10, 11, 12, 13],
 'n_estimators': [500,1000,1500]}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(classifier, parameters, scoring=acc_scorer, cv = 3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, Y_train)

# Set the classifier to the best combination of parameters
classifier = grid_obj.best_estimator_

# Fit the best algorithm to the data
classifier.fit(X_train, Y_train)

#Prediction the Test data
Y_pred = classifier.predict(X_test)

#Confusion Matrix
cm = pd.DataFrame(confusion_matrix(Y_test, Y_pred))

#Accuracy
accuracy1 = accuracy_score(Y_test, Y_pred)
print("Accuracy1: %.2f%%" % (accuracy1 * 100.0))

# k-Fold Class Validation
accuracy1 = cross_val_score(estimator = classifier, X = X_train, y = Y_train, cv = 10)
kfold = accuracy1.mean()
accuracy1.std()
Run Code Online (Sandbox Code Playgroud)

K_7*_*K_7 6

这可能不是您正在寻找的准确答案,本文概述了一种技术,如下所示:

\n
\n

我们可以通过将 k 类序数回归问题转换为 k-1 二元分类问题来利用有序类值,我们将序数属性 A* 转换为序数值 V1、V2、V3、\xe2\x80\xa6 Vk分解为 k-1 个二进制属性,每个属性对应原始属性\xe2\x80\x99 的前 k 个\xe2\x88\x92 1 个值。第 i 个二进制属性表示测试 A* > Vi

\n
\n

本质上,聚合多个二元分类器(预测目标 > 1、目标 > 2、目标 > 3、目标 > 4),以便能够预测目标是 1、2、3、4 还是 5。作者创建了一个 OrdinalClassifier 类,该类在 Python 字典中存储多个二元分类器。

\n
class OrdinalClassifier():\n\n    def __init__(self, clf):\n        self.clf = clf\n        self.clfs = {}\n\n    def fit(self, X, y):\n        self.unique_class = np.sort(np.unique(y))\n        if self.unique_class.shape[0] > 2:\n            for i in range(self.unique_class.shape[0]-1):\n                # for each k - 1 ordinal value we fit a binary classification problem\n                binary_y = (y > self.unique_class[i]).astype(np.uint8)\n                clf = clone(self.clf)\n                clf.fit(X, binary_y)\n                self.clfs[i] = clf\n\n    def predict_proba(self, X):\n        clfs_predict = {k: self.clfs[k].predict_proba(X) for k in self.clfs}\n        predicted = []\n        for i, y in enumerate(self.unique_class):\n            if i == 0:\n                # V1 = 1 - Pr(y > V1)\n                predicted.append(1 - clfs_predict[i][:,1])\n            elif i in clfs_predict:\n                # Vi = Pr(y > Vi-1) - Pr(y > Vi)\n                 predicted.append(clfs_predict[i-1][:,1] - clfs_predict[i][:,1])\n            else:\n                # Vk = Pr(y > Vk-1)\n                predicted.append(clfs_predict[i-1][:,1])\n        return np.vstack(predicted).T\n\n    def predict(self, X):\n        return np.argmax(self.predict_proba(X), axis=1)\n\n    def score(self, X, y, sample_weight=None):\n        _, indexed_y = np.unique(y, return_inverse=True)\n        return accuracy_score(indexed_y, self.predict(X), sample_weight=sample_weight)\n
Run Code Online (Sandbox Code Playgroud)\n

该技术起源于序数分类的简单方法

\n