KNeighborsClassifier 中 k 的值

Question

KNeighborsClassifier 中 k 的值

use*_*396 4 python machine-learning knn python-3.x

我正在努力寻找最佳K价值KNeighborsClassifier。

这是我的数据集代码iris：

k_loop = np.arange(1,30)
k_scores = []
for k in k_loop:
    knn = KNeighborsClassifier(n_neighbors=k)
    cross_val = cross_val_score(knn, X, y, cv=10 , scoring='accuracy')
    k_scores.append(cross_val.mean())

Run Code Online (Sandbox Code Playgroud)

我在每个循环中取了 cross_val_score 的平均值并绘制了它。

plt.style.use('fivethirtyeight')
plt.plot(k_loop, k_scores)
plt.show()

Run Code Online (Sandbox Code Playgroud)

这就是结果。

k您可以看到，当介于到14之间时，准确度更高20。

1）如何选择k的最佳值。

2）还有其他方法来计算和找到最佳值吗K？

3）任何其他改进建议也将受到赞赏。我是新来的ML

Answer 1

Yah*_*hya 5

我们首先定义什么是K？

K是算法咨询以决定给定数据点属于哪个类的投票者数量。

换句话说，它用来K划分每个类别的界限。这些边界将每个类别与其他类别分开。

因此，随着值的增加，边界变得更加平滑K。

所以从逻辑上讲，如果我们增加到K无穷大，它最终将成为任何类别的所有点，取决于总多数！然而，这会导致所谓的高偏差（即欠拟合）。

相反，如果我们K只使等于1 ，那么训练样本的误差将始终为零。这是因为与任何训练数据点最接近的点就是其本身。然而，我们最终会过度拟合边界（即高方差），因此它无法概括任何新的和未见过的数据！

不幸的是，没有经验法则。选择K在某种程度上是由最终应用程序和数据集驱动的。

建议的解决方案

使用GridSearchCV对估计器的指定参数值执行详尽的搜索。所以我们用它来尝试找到的最佳值K。

对我来说，当我想设置的最大阈值时，我不会超过每个类中元素数量的最大类K，并且到目前为止它并没有让我失望（请参阅稍后的示例以了解我的想法）我正在谈论）

例子：

import numpy as np from sklearn import svm, datasets from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier iris = datasets.load_iris() X, y = iris.data, iris.target # get the max class with respect to the number of elements max_class = np.max(np.bincount(y)) # you can add other parameters after doing your homework research # for example, you can add 'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'] grid_param = {'n_neighbors': range(1, max_class)} model = KNeighborsClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2) clf = GridSearchCV(model, grid_param, cv=cv, scoring='accuracy') clf.fit(X, y) print("Best Estimator: \n{}\n".format(clf.best_estimator_)) print("Best Parameters: \n{}\n".format(clf.best_params_)) print("Best Score: \n{}\n".format(clf.best_score_))
Run Code Online (Sandbox Code Playgroud)
结果

Best Estimator: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=17, p=2, weights='uniform') Best Parameters: {'n_neighbors': 17} Best Score: 0.98
Run Code Online (Sandbox Code Playgroud)

关于更新RepeatedStratifiedKFold

简单来说，就是KFold重复了很多n_repeats次，为什么？因为它可以降低偏差并为您提供更好的统计估计。

此外，它还Stratified寻求确保每个类在每个测试折叠中大致相等地表示（即每个折叠代表数据的所有层）。

归档时间：	7 年，4 月前
查看次数：	2862 次
最近记录：	7 年，4 月前