min_samples_split 必须至少为 2 或在 (0, 1] 中,得到 1

YNr*_*YNr 1 python classification scikit-learn

min_samples_split must be at least 2 or in (0, 1], got 1我定义了一个二元分类器,如下所示:我用“gbc”方法(梯度提升分类器)调用它,并在最后一行中得到错误。featuresClasses 是一个数据框,featureLabels 是特征列表。

Binary_classifier(method, featureLabels, featuresClasses):

    membershipIds = list(set(featuresClasses['membershipId']))
    n_membershipIds = len(membershipIds)

    index_rand = np.random.permutation(n_membershipIds)
    test_size = int(0.3 * n_membershipIds)

    membershipIds_test = list(itemgetter(*index_rand[:test_size])(membershipIds))
    membershipIds_train = list(itemgetter(*index_rand[test_size+1:])(membershipIds))

    data_test = featuresClasses[featuresClasses['membershipId'].isin(membershipIds_test)]
    data_train = featuresClasses[featuresClasses['membershipId'].isin(membershipIds_train)]

    data_test = data_test[data_test['standing'].isin([0, 1])]
    data_train = data_train[data_train['standing'].isin([0, 1])]

    X_test = data_test[featureLabels].as_matrix()
    y_test = data_test['standing'].values.astype(int)

    X_train = data_train[featureLabels].as_matrix()
    y_train = data_train['standing'].values.astype(int)

    # -------------------------- Run classifier
    print 'Binary classification by', method

    if method == 'svm':
        classifier = svm.SVC(kernel='linear', probability=True)
        y_score = classifier.fit(X_train, y_train).decision_function(X_test)

    elif method == 'gbc':
        params = {'n_estimators': 200, 'max_depth': 3, 'min_samples_split': 1, 'learning_rate': 0.1, 'loss': 'deviance'}

        classifier = GradientBoostingClassifier(**params)
        y_score = classifier.fit(X_train, y_train).predict(X_test)
Run Code Online (Sandbox Code Playgroud)

Viv*_*mar 5

根据GradientBoostingClassifier 文档

min_samples_split :int、float、可选(默认=2)

The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number.
    If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) 
               are the minimum number of samples for each split.
Run Code Online (Sandbox Code Playgroud)

您在代码中指定了'min_samples_split': 1. 这不是一个有效的案例。它的最小 int 值为 2。如果您想输入 1 作为 float(这意味着 1* 特征数)(即您想将所有特征放入min_samples_split),则指定为'min_samples_split': 1.0。当指定为1时,它被视为int,因此会发生错误。

这是错误显示为 (0,1] 而不是 (0.0, 1.0]) 的差异,这导致了混乱。在 scikit-learn 的 github 问题上也提出了这个问题,并已在下一个版本中实现: