Dee*_*kar 5 python machine-learning scikit-learn
更新
我正在研究机器学习文本分类,并且使用 svc 线性内核,整个代码都在工作,除了最后一行代码 (print (svm_model_linear.predict_proba(test)) 实际上是在构建一个分类器,其中有 3 个类别循环、足球和羽毛球,我有一些被标记为这些类别的人的 Facebook 状态,我也使用 train_test_split 训练了测试的分类器,之后我有一些未标记的状态,我想对它们进行分类,但最后一行代码给出我的错误
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 700)
X = cv.fit_transform(corpus).toarray()
print X
y = dataset.iloc[:, 1].values
print y
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.20, random_state = 0)
from sklearn.svm import SVC
svm_model_linear = SVC(kernel ='linear', C = 1,
probability=True).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)
# model accuracy for X_test
accuracy = svm_model_linear.score(X_test, y_test)
#creating a confusion matrix
cm = confusion_matrix(y_test, svm_predictions)
Run Code Online (Sandbox Code Playgroud)
无标签数据的分类从这里开始
data = pd.read_csv('sentence.csv', delimiter = '\t', quoting = 3)
test = []
for j in range(0, 5):
review = re.sub('[^a-zA-Z]', ' ', data['Sentence'][j])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in
set(stopwords.words('english'))]
review = ' '.join(review)
test.append(review)
pred = cv.fit_transform(test).toarray()
print (svm_model_linear.predict_proba(test))
Run Code Online (Sandbox Code Playgroud)
错误
print (svm_model_linear.predict_proba(test))
Traceback (most recent call last):
File "<ipython-input-7-5fa676a0fc00>", line 1, in <module>
print (svm_model_linear.predict_proba(test))
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 594, in _predict_proba
X = self._validate_for_predict(X)
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/svm/base.py", line 439, in _validate_for_predict
X = check_array(X, accept_sparse='csr', dtype=np.float64, order="C")
File "/home/letsperf/.local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 402, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: X.shape[1] = 15 should be equal to 700, the number of features at training time
Run Code Online (Sandbox Code Playgroud)
Scikit 估计器不适用于字符串,仅适用于数值数据。您的训练部分已成功完成,因为您已使用 CountVectorizer 将语料库从字符串转换为数字。您这样做不是为了测试数据。
您需要调用cv.tranform(test)测试数据,使其与用于训练模型的 X 类似。只有这样,它才会成功,才有意义。
还要确保您使用将cv原始火车转换corpus为数字形式的相同对象。
更新:
您不需要fit_transform()测试数据,始终只transform()按照我上面的建议进行调用。您目前正在做的是:
pred = cv.fit_transform(test).toarray()
Run Code Online (Sandbox Code Playgroud)
它会忘记之前的训练并重新拟合计数向量化器,这将改变 的形状pred。将其更改为:
pred = cv.transform(test).toarray()
Run Code Online (Sandbox Code Playgroud)