Eoi*_*igh 4 python machine-learning knn
我想从头开始编写自己的kNN算法,原因是我需要权衡这些功能。问题是尽管删除了for循环并使用了内置的numpy功能,我的程序仍然非常慢。
谁能建议一种加快速度的方法?我不使用np.sqrtL2距离,因为这是不必要的,实际上会使速度大大降低。
class GlobalWeightedKNN:
"""
A k-NN classifier with feature weights
Returns: predictions of k-NN.
"""
def __init__(self):
self.X_train = None
self.y_train = None
self.k = None
self.weights = None
self.predictions = list()
def fit(self, X_train, y_train, k, weights):
self.X_train = X_train
self.y_train = y_train
self.k = k
self.weights = weights
def predict(self, testing_data):
"""
Takes a 2d array of query cases.
Returns a list of predictions for k-NN classifier
"""
np.fromiter((self.__helper(qc) for qc in testing_data), float)
return self.predictions
def __helper(self, qc):
neighbours = np.fromiter((self.__weighted_euclidean(qc, x) for x in self.X_train), float)
neighbours = np.array([neighbours]).T
indexes = np.array([range(len(self.X_train))]).T
neighbours = np.append(indexes, neighbours, axis=1)
# Sort by second column - distances
neighbours = neighbours[neighbours[:,1].argsort()]
k_cases = neighbours[ :self.k]
indexes = [x[0] for x in k_cases]
y_answers = [self.y_train[int(x)] for x in indexes]
answer = max(set(y_answers), key=y_answers.count) # get most common value
self.predictions.append(answer)
def __weighted_euclidean(self, qc, other):
"""
Custom weighted euclidean distance
returns: floating point number
"""
return np.sum( ((qc - other)**2) * self.weights )
Run Code Online (Sandbox Code Playgroud)
你可以看一下这篇很棒的文章,介绍faiss
\n让 kNN 在 20 行内比 Scikit-learn\xe2\x80\x99s 快 300 倍!
\nit 是在GPU上并在CPP中开发的
import numpy as np\nimport faiss\n\n\nclass FaissKNeighbors:\n def __init__(self, k=5):\n self.index = None\n self.y = None\n self.k = k\n\n def fit(self, X, y):\n self.index = faiss.IndexFlatL2(X.shape[1])\n self.index.add(X.astype(np.float32))\n self.y = y\n\n def predict(self, X):\n distances, indices = self.index.search(X.astype(np.float32), k=self.k)\n votes = self.y[indices]\n predictions = np.array([np.argmax(np.bincount(x)) for x in votes])\n return predictions\nRun Code Online (Sandbox Code Playgroud)\n
Scikit学习使用KD树或球树来计算最近的邻居O[N log(N)]。您的算法是一种直接的方法,需要O[N^2]时间,并且在Python生成器表达式中使用嵌套的for循环,与优化的代码相比,这会增加大量的计算开销。
如果您想使用快速O[N log(N)]实现来计算加权k邻域分类,则可以使用sklearn.neighbors.KNeighborsClassifier加权minkowski度量,设置p=2(用于欧式距离)和设置w为所需的权重。例如:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(metric='wminkowski', p=2,
metric_params=dict(w=weights))
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
Run Code Online (Sandbox Code Playgroud)