使用scikit Random Forest sample_weights

Question

使用scikit Random Forest sample_weights

ADJ*_*ADJ 8 python random-forest scikit-learn

我一直试图找出scikit的随机森林sample_weight使用,我无法解释我看到的一些结果.从根本上说,我需要它来平衡分类问题和不平衡类.特别是,我期待如果我使用所有1的sample_weights数组,我会得到与w sample_weights = None相同的结果.另外,我正在考虑任何等权重阵列(即所有1,或全10或全0.8 ......)将提供相同的结果.也许在这种情况下,我的权重直觉是错误的.这是代码:

import numpy as np
from sklearn import ensemble,metrics, cross_validation, datasets

#create a synthetic dataset with unbalanced classes
X,y = datasets.make_classification(
n_samples=10000, 
n_features=20, 
n_informative=4, 
n_redundant=2, 
n_repeated=0, 
n_classes=2, 
n_clusters_per_class=2, 
weights=[0.9],
flip_y=0.01,
class_sep=1.0, 
hypercube=True, 
shift=0.0, 
scale=1.0, 
shuffle=True, 
random_state=0)

model = ensemble.RandomForestClassifier()

w0=1 #weight associated to 0's
w1=1 #weight associated to 1's

#I should split train and validation but for the sake of understanding sample_weights I'll skip this step
model.fit(X, y,sample_weight=np.array([w0 if r==0 else w1 for r in y]))    
preds = model.predict(X)
probas = model.predict_proba(X)
ACC = metrics.accuracy_score(y,preds)
precision, recall, thresholds = metrics.precision_recall_curve(y, probas[:, 1])
fpr, tpr, thresholds = metrics.roc_curve(y, probas[:, 1])
ROC = metrics.auc(fpr, tpr)
cm = metrics.confusion_matrix(y,preds)
print "ACCURACY:", ACC
print "ROC:", ROC
print "F1 Score:", metrics.f1_score(y,preds)
print "TP:", cm[1,1], cm[1,1]/(cm.sum()+0.0)
print "FP:", cm[0,1], cm[0,1]/(cm.sum()+0.0)
print "Precision:", cm[1,1]/(cm[1,1]+cm[0,1]*1.1)
print "Recall:", cm[1,1]/(cm[1,1]+cm[1,0]*1.1)

Run Code Online (Sandbox Code Playgroud)

例如,当w0 = w1 = 1时,我得到F1 = 0.9456.例如,当w0 = w1 = 10时,我得到F1 = 0.9569.使用sample_weights = None,我得到F1 = 0.9474.

谢谢,

G

Answer 1

eri*_*mjl 7

使用随机森林算法,顾名思义,有一些"随机"的算法.

您获得不同的F1分数,因为随机森林算法(RFA)使用您的数据子集来生成决策树,然后对所有树进行平均.因此,我对你的每次跑步都有类似(但不相同)的F1得分并不感到惊讶.

我之前尝试过平衡重量.您可能希望尝试根据总体中每个类的大小来平衡权重.例如,如果您有两个类:

Class A: 5 members
Class B: 2 members

Run Code Online (Sandbox Code Playgroud)

您可能希望通过为每个Class A成员分配2/7和为每个成员分配5/7 来平衡权重Class B.不过,这只是一个想法作为起点.你如何对课程进行加权取决于你遇到的问题.

一旦我为随机森林设置种子,事情开始变得有意义了. (3认同)
如果要设置类权重,则应该在`RandomForestClassifier`初始化中使用`class_weight`可选参数.http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (2认同)

归档时间：	12 年前
查看次数：	6258 次
最近记录：	6 年，8 月前