标签: boosting

为 sklearn 的 GradientBoostingClassifier 生成代码

我想从经过训练的梯度增强分类器（来自 sklearn）生成代码（现在是 Python，但最终是 C）。据我了解，该模型采用初始预测器，然后添加来自顺序训练的回归树的预测（按学习因子缩放）。所选择的类别就是具有最高产出值的类别。

这是我到目前为止的代码：

def recursep_gbm(left, right, threshold, features, node, depth, value, out_name, scale):
    # Functions for spacing
    tabs = lambda n: (' ' * n * 4)[:-1]
    def print_depth():
        if depth: print tabs(depth),
    def print_depth_b():
        if depth: 
            print tabs(depth), 
            if (depth-1): print tabs(depth-1),

    if (threshold[node] != -2):
        print_depth()
        print "if " + features[node] + " <= " + str(threshold[node]) + ":"
        if left[node] != -1:
            recursep_gbm(left, right, threshold, features, left[node], depth+1, value, out_name, scale)
        print_depth()
        print "else:"
        if …

Run Code Online (Sandbox Code Playgroud)

python machine-learning scikit-learn boosting

Pok*_*son

2017 05-23

5
推荐指数

1
解决办法

2932
查看次数

使用树输出在Spark中使用Gradient Boosting Tree预测类的概率

众所周知,Spark中的GBT为您提供了截至目前的预测标签.

我正在考虑尝试计算一个类的预测概率(比如说属于某个叶子的所有实例)

构建GBT的代码

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache() 
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for …

Run Code Online (Sandbox Code Playgroud)

tree probability prediction boosting apache-spark-mllib

PAR*_*DER

lucky-day

5
推荐指数

1
解决办法

5118
查看次数

Catboost：l2_leaf_reg 的合理值是多少？

在大型数据集（约 1M 行，500 列）上运行 catboost，我得到：训练已停止（迭代 0 时的退化解决方案，可能 l2 正则化太小，尝试增加它）。

我如何猜测 l2 正则化值应该是多少？它与 y 的平均值、变量数量、树深度有关吗？

谢谢！

machine-learning boosting catboost

Guy*_*ini

lucky-day

5
推荐指数

1
解决办法

5108
查看次数

使用加权类处理 GradientBoostingClassifier 中的不平衡数据？

我有一个非常不平衡的数据集，我需要在此基础上构建一个模型来解决分类问题。该数据集有大约 30000 个样本，其中大约 1000 个样本被标记为\xe2\x80\x941\xe2\x80\x94，其余为 0。我通过以下几行构建模型：

\n\n

X_train=training_set\ny_train=target_value\nmy_classifier=GradientBoostingClassifier(loss=\'deviance\',learning_rate=0.005)\nmy_model = my_classifier.fit(X_train, y_train)\n

Run Code Online (Sandbox Code Playgroud)\n\n

由于这是一个不平衡的数据，因此像上面的代码一样简单地构建模型是不正确的，所以我尝试使用类权重，如下所示：

\n\n

class_weights = compute_class_weight(\'balanced\',np.unique(y_train), y_train)\n

Run Code Online (Sandbox Code Playgroud)\n\n

现在，我不知道如何使用 class_weights（基本上包括 0.5 和 9.10 值）来训练和构建模型GradientBoostingClassifier。

\n\n

任何想法？我如何使用加权类或其他技术处理这些不平衡的数据？

python machine-learning training-data scikit-learn boosting

Spe*_*edo

2019 06-08

5
推荐指数

1
解决办法

5063
查看次数

XGBoost 中的特征重要性“增益”

我想了解 xgboost 中的特征重要性是如何通过“增益”计算的。从https://towardsdatascience.com/be-careful-when-interpreting-your-features-importance-in-xgboost-6e16132588e7：

“增益”是特征为其所在分支带来的准确性的提高。这个想法是在一个特征X上添加一个新的分裂到分支之前有一些错误分类的元素，在这个特征上添加分裂后，有两个新分支，每个分支都更准确（一个分支说如果你的观察是在这个分支上，那么它应该被归类为 1，而另一个分支则正好相反）。

在 scikit-learn 中，特征重要性是通过使用变量分裂后每个节点的基尼杂质/信息增益减少来计算的，即节点的加权杂质平均值 - 左子节点的加权杂质平均值 - 右子节点的加权杂质平均值（参见还有：https : //stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting）

我想知道 xgboost 是否也使用上述引文中所述的使用信息增益或准确性的方法。我试着挖了xgboost的代码，发现了这个方法（已经把不相关的部分剪掉了）：

def get_score(self, fmap='', importance_type='gain'):
    trees = self.get_dump(fmap, with_stats=True)

    importance_type += '='
    fmap = {}
    gmap = {}
    for tree in trees:
        for line in tree.split('\n'):
            # look for the opening square bracket
            arr = line.split('[')
            # if no opening bracket (leaf node), ignore this line
            if len(arr) == 1:
                continue

            # look for the closing bracket, extract only info …

Run Code Online (Sandbox Code Playgroud)

python scikit-learn boosting xgboost information-gain

nel*_*lng

2019 08-06

5
推荐指数

1
解决办法

5857
查看次数

xgboost 多类工作中 base_score 的用途是什么？

我正在尝试探索 Xgboost 二进制分类以及多类的工作。在二进制类的情况下，我观察到base_score被视为起始概率，并且在计算Gain和Cover时也显示出重大影响。

在多类的情况下，我无法弄清楚base_score参数的重要性，因为它向我显示了不同（任何）base_score值的Gain和Cover的相同值。

我也无法找出为什么在计算多类的覆盖率时存在因子 2，即2*p*(1-p)

有人可以帮我解决这两部分吗？

statistics machine-learning boosting xgboost multiclass-classification

jay*_*hor

lucky-day

5
推荐指数

1
解决办法

1237
查看次数

如何在 LightGBM 中使用“is_unbalance”和“scale_pos_weight”参数来处理不平衡的二元分类项目 (80:20)

我目前有一个不平衡的数据集，如下图所示：

True然后，我在训练 LightGBM 模型时使用“is_unbalance”参数，将其设置为。下图显示了我如何使用此参数。

使用本机 API 的示例：

使用 sckit-learnAPI 的示例：

我的问题是：

我应用参数的方式is_unbalance正确吗？

如何使用scale_pos_weight代替is_unbalance？

或者我应该使用SMOTE-ENN或SMOTE+TOME等SMOTE技术来平衡数据集？

谢谢！

python classification boosting lightgbm imbalanced-data

Min*_*Lim

2022 04-13

4
推荐指数

1
解决办法

1万
查看次数

XGBoost 中的绘图编号格式plot_importance()

我训练了一个 XGBoost 模型，并使用plot_importance() 来绘制训练模型中最重要的特征。尽管如此，图中的数字有几个小数值，这些值淹没了绘图并且不适合绘图。

我已经搜索了绘图格式选项，但我只找到了如何格式化轴（尝试格式化 X 轴，希望它也能格式化相应的轴）

我在 Jupyter Notebook 中工作（如果这有什么区别的话）。代码如下：

xg_reg = xgb.XGBClassifier( objective = 'binary:logistic', colsample_bytree = 0.4, learning_rate = 0.01, max_depth = 15, alpha = 0.1, n_estimators = 5, subsample = 0.5, scale_pos_weight = 4 ) xg_reg.fit(X_train, y_train) preds = xg_reg.predict(X_test) ax = xgb.plot_importance(xg_reg, max_num_features=3, importance_type='gain', show_values=True) fig = ax.figure fig.set_size_inches(10, 3)
Run Code Online (Sandbox Code Playgroud)
我有什么遗漏的吗？是否有任何格式化函数或参数要传递？

我希望能够格式化特征重要性分数，或者至少删除小数部分（例如“25”而不是“25.66521”）。下面附上当前的图。

xgboost_feature_importance_scores

python plot matplotlib boosting xgboost

Gie*_*itė

2019 05-11

3
推荐指数

1
解决办法

9489
查看次数

我可以使用XGBoost增强其他模型（例如，朴素贝叶斯，随机森林）吗？

我正在从事欺诈分析项目，因此需要一些帮助。以前，我使用SAS Enterprise Miner来了解有关增强/集成技术的更多信息，并且我了解到增强可以帮助改善模型的性能。

目前，我的小组已在Python上完成以下模型：朴素贝叶斯，随机森林和神经网络我们想使用XGBoost来改善F1得分。我不确定这是否可行，因为我只遇到过有关如何单独执行XGBoost或Naive Bayes的教程。

我正在寻找一个教程，他们将向您展示如何创建朴素贝叶斯模型，然后使用Boosting。此后，我们可以比较指标是否有提升，以查看指标是否有所改善。我是机器学习的新手，所以我可能对这个概念不对。

我曾考虑过替换XGBoost中的值，但不确定要更改哪个值，或者甚至不能以这种方式工作。

朴素贝叶斯

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_sm,y_sm, test_size = 0.2, random_state=0) from sklearn.naive_bayes import GaussianNB from sklearn.metrics import confusion_matrix, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score nb = GaussianNB() nb.fit(X_train, y_train) nb_pred = nb.predict(X_test)
Run Code Online (Sandbox Code Playgroud)
XGBoost

from sklearn.model_selection import train_test_split import xgboost as xgb from xgboost import XGBClassifier from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_sm,y_sm, test_size = 0.2, random_state=0) model = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.9, gamma=0, learning_rate=0.1, max_delta_step=0, …
Run Code Online (Sandbox Code Playgroud)

python machine-learning boosting xgboost

Jan*_*ane

2019 10-27

1
推荐指数

1
解决办法

63
查看次数

梯度提升变量重要性

我已经适应了我的梯度提升模型，并试图打印出可变的重要性。我使用了相同的代码，并使用随机森林。运行varImp（）时，我不断收到错误消息。错误如下。

Error in code$varImp(object$finalModel, ...) : could not find function "relative.influence" #Split into testing and training set.seed(7) Data_Splitting <- createDataPartition(clean_data$Output,p=2/3,list=FALSE) training = clean_data[Data_Splitting,] testing = clean_data[-Data_Splitting,] #Random Forest training part set.seed(7) gbm_train <- train(Output~., data=training, method = "gbm", trControl = trainControl(method="cv",number=4,classProbs = T,summaryFunction = twoClassSummary),metric="ROC") #Plot of variable importance varImp(gbm_train) summary.gbm(gbm_train) plot(varImp(gbm_train)) print(gbm) #Random Forest Testing phase gbm_predict = predict(gbm_train,newdata=testing,type="prob")
Run Code Online (Sandbox Code Playgroud)

variables gradient boosting

Dus*_*ith

lucky-day

0
推荐指数

1
解决办法

1445
查看次数

标签统计

boosting ×10

python ×6

machine-learning ×5

xgboost ×4

scikit-learn ×3

apache-spark-mllib ×1

catboost ×1

classification ×1

gradient ×1

imbalanced-data ×1

information-gain ×1

lightgbm ×1

matplotlib ×1

multiclass-classification ×1

plot ×1

prediction ×1

probability ×1

statistics ×1

training-data ×1

tree ×1

variables ×1

标签 统计

标签统计