fin*_*ity 4 python machine-learning svm scikit-learn
我正在尝试使用scikit-learn构建一个简单的SVM文档分类器,我使用以下代码:
import os
import numpy as np
import scipy.sparse as sp
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cross_validation
from sklearn.datasets import load_svmlight_file
clf=svm.SVC()
path="C:\\Python27"
f1=[]
f2=[]
data2=['omg this is not a ship lol']
f=open(path+'\\mydata\\ACQ\\acqtot','r')
f=f.read()
f1=f.split(';',1085)
for i in range(0,1086):
f2.append('acq')
f1.append('shipping ship')
f2.append('crude')
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1)
counter = CountVectorizer(min_df=1)
x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.fit_transform(data2)
num_sample,num_features=x_train.shape
test_sample,test_features=x_test.shape
print("#samples: %d, #features: %d" % (num_sample, num_features)) #samples: 5, #features: 25
print("#samples: %d, #features: %d" % (test_sample, test_features))#samples: 2, #features: 37
y=['acq','crude']
#print x_test.n_features
clf.fit(x_train,f2)
#den= clf.score(x_test,y)
clf.predict(x_test)
Run Code Online (Sandbox Code Playgroud)
它给出以下错误:
(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 6 should be equal to 9451, the number of features at training time
Run Code Online (Sandbox Code Playgroud)
但我不理解的是为什么它会期待不.功能是一样的吗?如果我向机器输入一个它需要预测的绝对新的文本数据,显然不可能每个文档都具有与用于训练它的数据相同数量的功能.在这种情况下,我们是否必须明确将测试数据的特征数设置为9451?
emi*_*ara 14
为了确保您具有相同的要素表示,您不应该fit_transform您的测试数据,而只是转换它.
x_train=vectorizer.fit_transform(f1)
x_test=vectorizer.transform(data2)
Run Code Online (Sandbox Code Playgroud)
类似的转换为同类特征应该应用于您的标签.