Scikit-learn OneVsRestClassifier:为真实标签和预测标签获取不同大小的二进制指示器

bev*_*age 5 machine-learning python-3.x scikit-learn

我正在尝试获取使用 scikit-learn 的 OneVsRestClassifier 构建的分类器的指标,以解决多标签分类问题。但是,我无法让指标库正常工作,因为我尝试比较真实标签和预测标签的二进制指标大小不同。下面是代码,大部分取自使用 scikit-learn 分类为多个类别

import numpy as np
import collections
import csv
import os
import sys
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
import sklearn.metrics as metrics

np.set_printoptions(threshold=sys.maxsize)

csv_read_args = ({'mode': 'rb'} if sys.version_info[0] < 3 else
                 {'mode': 'rt', 'newline': '', 'encoding': 'latin1'})

with open(os.path.abspath('somefilepath'), **csv_read_args) as myfile:
    reader = csv.reader(myfile)
    next(reader)
    a, b = [], []
    # feed generator expression into a zero-length deque to consume it
    generator = ((a.append(row[2]), b.append(row[1].split(";"))) for row in reader)
    collections.deque(generator, maxlen=0)

X_train = np.array(a)
y_train_text = b

with open(os.path.abspath('some filepath'), **csv_read_args) as myfile:
    reader = csv.reader(myfile)
    next(reader)
    c, d = [], []
    generator = ((c.append(row[2]), d.append(row[1].split(";"))) for row in reader)
    collections.deque(generator, maxlen=0)

X_test = np.array(c)

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

mlb = MultiLabelBinarizer()
true = mlb.fit_transform(d)

print(true.shape)
print(predicted.shape)

print(metrics.f1_score(true, predicted, average="micro"))
Run Code Online (Sandbox Code Playgroud)

在最后一行,我收到一条错误消息: ValueError: Multi-label bin Indicator input with different number of labels

为什么我的真实指标和预测指标带有不同数量的标签?是否因为我的训练数据集可能具有测试数据集中不存在的标签,反之亦然?如果是这样,我该如何解释?