如何在xgboost中针对不平衡数据设置多类分类中的权重？

Question

如何在xgboost中针对不平衡数据设置多类分类中的权重？

Abh*_*jan 5 xgboost multiclass-classification

我知道您可以为不平衡的数据集设置scale_pos_weight。然而，如何处理不平衡数据集中的多分类问题。我已经浏览过https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost/18823，但不太明白如何在 Dmatrix 中设置权重参数。

有人可以详细解释一下吗？

Answer 1

小智 5

对于不平衡数据集，我在 Xgboost 中使用了“权重”参数，其中权重是根据数据所属类分配的权重数组。

def CreateBalancedSampleWeights(y_train, largest_class_weight_coef):
    classes = np.unique(y_train, axis = 0)
    classes.sort()
    class_samples = np.bincount(y_train)
    total_samples = class_samples.sum()
    n_classes = len(class_samples)
    weights = total_samples / (n_classes * class_samples * 1.0)
    class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
    class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * 
    largest_class_weight_coef
    sample_weights = [class_weight_dict[y] for y in y_train]
    return sample_weights

Run Code Online (Sandbox Code Playgroud)

只需传递目标列和最频繁类别的出现率（如果最频繁类别在 100 个样本中有 75 个，则为 0.75）

    largest_class_weight_coef = 
    max(df_copy['Category'].value_counts().values)/df.shape[0]
    
    #pass y_train as numpy array
    weight = CreateBalancedSampleWeights(y_train, largest_class_weight_coef)

    #And then use it like this
    xg = XGBClassifier(n_estimators=1000, weights = weight, max_depth=20)

Run Code Online (Sandbox Code Playgroud)

就是这样：）

归档时间：	8 年，1 月前
查看次数：	11116 次
最近记录：	2 年，7 月前