我正在做不同的文本分类实验.现在我需要计算每项任务的AUC-ROC.对于二进制分类,我已经使用此代码:
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)
# instances is a list of dictionaries
#visualisation ROC-AUC
fpr, tpr, thresholds = roc_curve(y, y_pred)
auc = auc(fpr, tpr)
print('auc =', auc)
plt.figure()
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.2f'% auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
Run Code Online (Sandbox Code Playgroud)
但现在我需要为多类分类任务执行此操作.我读到了我需要对标签进行二值化的地方,但我真的不知道如何计算多类分类的ROC.提示?
python roc scikit-learn text-classification multiclass-classification
对于文本分类项目(年龄),我正在制作我的数据的子集.我用文件名制作了3个列表,按年龄排序.我想要对这些列表进行随机播放,然后将每个混洗列表中的5000个文件名附加到新列表中.结果应该是具有15000个文件(5000 10s,5000 20s,5000 30s)的数据子集.你可以看到我到目前为止所写的内容.但我知道random.shuffle返回none和无类型对象不可迭代.我怎么解决这个问题?
def seed():
return 0.47231099848
teens = [list of files]
tweens = [list of files]
thirthies = [list of files]
data = []
for categorie in random.shuffle([teens, tweens, thirthies],seed):
data.append(teens[:5000])
data.append(tweens[:5000])
data.append(thirthies[:5000])
Run Code Online (Sandbox Code Playgroud) 我正在使用scikit学习文本分类实验.现在我想获得性能最佳,所选功能的名称.我尝试了类似问题的一些答案,但没有任何效果.最后一行代码是我尝试过的一个例子.例如,当我打印时feature_names,我收到此错误:sklearn.exceptions.NotFittedError: This SelectKBest instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
任何解决方案?
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
feature_names = pipe.named_steps['mutual_info']
X.columns[features.transform(np.arange(len(X.columns)))]
Run Code Online (Sandbox Code Playgroud) 我想用 ROC 曲线评估我的分类模型。我正在努力为交叉验证的数据集计算多类 ROC 曲线。由于交叉验证,训练集和测试集没有划分。在下面,您可以看到我已经尝试过的代码。
scaler = StandardScaler(with_mean=False)
enc = LabelEncoder()
y = enc.fit_transform(labels)
vec = DictVectorizer()
feat_sel = SelectKBest(mutual_info_classif, k=200)
n_classes = 3
# Pipeline for computing of ROC curves
clf = OneVsRestClassifier(LogisticRegression(solver='newton-cg', multi_class='multinomial'))
clf = clf.label_binarizer_
pipe = Pipeline([('vectorizer', vec),
('scaler', scaler),
('Logreg', clf),
('mutual_info',feat_sel)])
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y[:, i], y_pred[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Plot …Run Code Online (Sandbox Code Playgroud) python ×4
scikit-learn ×3
roc ×2
append ×1
attributes ×1
list ×1
names ×1
random ×1
shuffle ×1