Python中更快的kNN算法

Eoi*_*igh 4 python machine-learning knn

我想从头开始编写自己的kNN算法,原因是我需要权衡这些功能。问题是尽管删除了for循环并使用了内置的numpy功能,我的程序仍然非常慢。

谁能建议一种加快速度的方法?我不使用np.sqrtL2距离,因为这是不必要的,实际上会使速度大大降低。

class GlobalWeightedKNN:
    """
    A k-NN classifier with feature weights

    Returns: predictions of k-NN.
    """

    def __init__(self):
        self.X_train = None
        self.y_train = None
        self.k = None
        self.weights = None
        self.predictions = list()

    def fit(self, X_train, y_train, k, weights):        
        self.X_train = X_train
        self.y_train = y_train
        self.k = k
        self.weights = weights

    def predict(self, testing_data):
        """
        Takes a 2d array of query cases.

        Returns a list of predictions for k-NN classifier
        """

        np.fromiter((self.__helper(qc) for qc in testing_data), float)  
        return self.predictions


    def __helper(self, qc):
        neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
        neighbours = np.array([neighbours]).T 
        indexes = np.array([range(len(self.X_train))]).T
        neighbours = np.append(indexes, neighbours, axis=1)

        # Sort by second column - distances
        neighbours = neighbours[neighbours[:,1].argsort()]  
        k_cases = neighbours[ :self.k]
        indexes = [x[0] for x in k_cases]

        y_answers = [self.y_train[int(x)] for x in indexes]
        answer = max(set(y_answers), key=y_answers.count)  # get most common value
        self.predictions.append(answer)


    def __weighted_euclidean(self, qc, other):
        """
        Custom weighted euclidean distance

        returns: floating point number
        """

        return np.sum( ((qc - other)**2) * self.weights )
Run Code Online (Sandbox Code Playgroud)

Ami*_*imi 7

你可以看一下这篇很棒的文章,介绍faiss
\n让 kNN 在 20 行内比 Scikit-learn\xe2\x80\x99s 快 300 倍!
\nit 是在GPU上并在CPP中开发的

\n
import numpy as np\nimport faiss\n\n\nclass FaissKNeighbors:\n    def __init__(self, k=5):\n        self.index = None\n        self.y = None\n        self.k = k\n\n    def fit(self, X, y):\n        self.index = faiss.IndexFlatL2(X.shape[1])\n        self.index.add(X.astype(np.float32))\n        self.y = y\n\n    def predict(self, X):\n        distances, indices = self.index.search(X.astype(np.float32), k=self.k)\n        votes = self.y[indices]\n        predictions = np.array([np.argmax(np.bincount(x)) for x in votes])\n        return predictions\n
Run Code Online (Sandbox Code Playgroud)\n


jak*_*vdp 5

Scikit学习使用KD树或球树来计算最近的邻居O[N log(N)]。您的算法是一种直接的方法,需要O[N^2]时间,并且在Python生成器表达式中使用嵌套的for循环,与优化的代码相比,这会增加大量的计算开销。

如果您想使用快速O[N log(N)]实现来计算加权k邻域分类,则可以使用sklearn.neighbors.KNeighborsClassifier加权minkowski度量,设置p=2(用于欧式距离)和设置w为所需的权重。例如:

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(metric='wminkowski', p=2,
                             metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
Run Code Online (Sandbox Code Playgroud)

  • 稍后回复并提供更多信息:kdtree 和 ball_tree 确实保证找到精确的最近邻居。“暴力”存在的原因有两个:(1)暴力对于小数据集来说速度更快,(2)它是一种更简单的算法,因此对于测试很有用。您可以确认这些算法在 [sklearn 单元测试](https://github.com/scikit-learn/scikit-learn/blob/88be2abb7b7f7450dc569e0065e672a0676e4130/sklearn/neighbors/tests/test_neighbors.py# L72-L96)。 (3认同)