class_weight在linearSVC和LogisticRegression的损失函数中的作用

Question

class_weight在linearSVC和LogisticRegression的损失函数中的作用

JRu*_*Run 5 svm scikit-learn logistic-regression

我想弄清楚到底损失函数公式是什么，以及如何我可以手动时计算它class_weight='auto'的情况下svm.svc，svm.linearSVC和linear_model.LogisticRegression。

要获得平衡的数据，请说您拥有训练有素的分类器：clf_c。后勤损失应为（我正确吗？）：

def logistic_loss(x,y,w,b,b0):
    '''
    x: nxp data matrix where n is number of data points and p is number of features.
    y: nx1 vector of true labels (-1 or 1).
    w: nx1 vector of weights (vector of 1./n for balanced data).
    b: px1 vector of feature weights.
    b0: intercept.
    '''
    s = y
    if 0 in np.unique(y):
        print 'yes'
        s = 2. * y - 1
    l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
    return l

Run Code Online (Sandbox Code Playgroud)

我意识到logisticRegression predict_log_proba()可以让您确切地知道数据何时平衡：

b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)

Run Code Online (Sandbox Code Playgroud)

请注意，np.floor((y+1)/2).astype(np.int8)只需将y =（-1,1）映射到y =（0,1）。

但这在数据不平衡时不起作用。

此外，您希望分类器（此处为logisticRegression）在数据处于平衡状态时和class_weight=None在数据与处于不平衡状态时表现相似（根据损失函数值）class_weight='auto'。我需要一种方法来计算两种情况下的损失函数（没有正则化项）并进行比较。

简而言之，这class_weight = 'auto' 到底是什么意思？是说class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.}还是说class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}？

非常感谢任何帮助。我尝试遍历源代码，但是我不是程序员，所以我陷入了困境。非常感谢。

Answer 1

ldi*_*rer 6

`class_weight` 启发式

您为class_weight='auto'启发式方法提出的第一个建议令我有些困惑，例如：

class_weight = {-1 : (y == 1).sum() / (y == -1).sum(), 
                1 : 1.}

Run Code Online (Sandbox Code Playgroud)

如果我们对其进行归一化，以使权重总和为1，则它与您的第二个命题相同。

无论如何要理解是什么class_weight="auto"，请看以下问题： svm scikit learning中class weight = none和auto之间有什么区别？

我将其复制到此处以供以后比较：

这意味着您拥有的每个类（在类中）的权重等于1除以该类在数据中出现的次数（y），因此，出现频率更高的类将获得较低的权重。然后，将其进一步除以所有逆类频率的平均值。

注意这并不完全明显；）。

此启发式方法已弃用，并将在0.18中删除。它将由另一个启发式代替class_weight='balanced'。

“平衡”启发式加权类与它们的频率的倒数成比例。

从文档：

“平衡”模式使用y的值来自动调整与输入数据中的类频率成反比的权重 n_samples / (n_classes * np.bincount(y))。

np.bincount(y) 是一个数组，元素i是第i类样本的计数。

下面是一些比较两者的代码：

import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight

n_classes = 3
n_samples = 1000

X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10, 
    n_classes=n_classes, weights=[0.05, 0.4, 0.55])

print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)

print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))

Run Code Online (Sandbox Code Playgroud)

输出：

Count of samples per class:  [ 57 396 547]
Balanced weights:  [ 5.84795322  0.84175084  0.60938452]
'auto' weights:  [ 2.40356854  0.3459682   0.25046327]

Run Code Online (Sandbox Code Playgroud)

损失函数

现在真正的问题是：这些权重如何用于训练分类器？

不幸的是，我在这里没有完整的答案。

对于SVC和linearSVC文档字符串非常清楚

对于SVC，将类i的参数C设置为class_weight [i] * C。

因此，较高的权重意味着该类别的正则化程度较低，并且svm对其进行正确分类的动机也较高。

我不知道它们如何与逻辑回归结合使用。我将尝试研究它，但是大多数代码在liblinear或libsvm中，而我对它们不太熟悉。

但是，请注意，中的权重class_weight 不会直接影响诸如的方法predict_proba。由于分类器优化了其他损失函数，因此它们更改了输出。
不确定是否很清楚，所以下面是一个片段来解释我的意思（您需要为导入和变量定义运行第一个）：

lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))

new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))

# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()

# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))

Run Code Online (Sandbox Code Playgroud)

希望这可以帮助。

归档时间：	10 年，7 月前
查看次数：	3832 次
最近记录：	10 年，7 月前

class_weight在linearSVC和LogisticRegression的损失函数中的作用

class_weight 启发式

损失函数

`class_weight` 启发式