为什么我的 SVM 的性能在扩展训练和测试数据后会下降？

Question

为什么我的 SVM 的性能在扩展训练和测试数据后会下降？

P.C*_*.C. 3 python machine-learning svm scikit-learn text-classification

我正在使用 scikit-learn 进行文本情感分析。我现在的功能只是词频计数。

当我执行以下操作时，平均 F 测量值约为 59%：

from sklearn import svm
clf = svm.LinearSVC(class_weight='auto');
clf.fit(Xfeatures, YLabels);
......
predictedLabels = clf.predict(XTestFeatures);

Run Code Online (Sandbox Code Playgroud)

但是当我使用 StandardScalar() 缩放特征向量时，平均 F 测量值下降到 49%：

from sklearn import svm
clf = svm.LinearSVC(class_weight='auto');
Xfeatures = scaler.fit_transform(Xfeatures);
clf.fit(Xfeatures, YLabels);
......
XTestFeatures = scaler.transform(XTestFeatures);
predictedLabels = clf.predict(XTestFeatures);

Run Code Online (Sandbox Code Playgroud)

缩放应该可以提高 SVM 的性能，但在这里，它似乎会降低性能。为什么会出现这种情况？我怎样才能做对呢？

Answer 1

Fre*_*Foo 5

对于术语频率来说，按均值和方差缩放并不是一个好的策略。假设您有两个包含三个项的项直方图（我们将其称为0, 1, 2）：

>>> X = array([[100, 10, 50], [1, 0, 2]], dtype=np.float64)

Run Code Online (Sandbox Code Playgroud)

然后你缩放它们；然后你得到

>>> from sklearn.preprocessing import scale
>>> scale(X)
array([[ 1.,  1.,  1.],
       [-1., -1., -1.]])

Run Code Online (Sandbox Code Playgroud)

缩放比例使得我们无法判断第 2 项X[1]比第 0 项出现得更频繁。事实上，第 1 项没有出现在的事实X[1]不再可区分。

当然，这是一个非常极端的例子，但在更大的集合中也会出现类似的效果。相反，您应该做的是标准化直方图：

>>> from sklearn.preprocessing import normalize
>>> normalize(X)
array([[ 0.89087081,  0.08908708,  0.4454354 ],
       [ 0.4472136 ,  0.        ,  0.89442719]])

Run Code Online (Sandbox Code Playgroud)

这保留了术语的相对频率，这正是您感兴趣的；线性情感分类器关心的是积极项多于消极项，而不是实际频率或其缩放变体。

（对于单个特征的尺度实际上并不重要的领域，建议进行缩放，通常是因为特征是用不同的单位来衡量的。）

归档时间：	11 年，1 月前
查看次数：	2831 次
最近记录：	11 年，1 月前