Osi*_*ris 6 nearest-neighbor euclidean-distance apache-spark pyspark spark-dataframe
鉴于包含以下信息的10亿条记录:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
Run Code Online (Sandbox Code Playgroud)
对于上面的每个ID,我想找到前10个最接近的ID,基于它们的向量的欧几里德距离(x1,x2,...,x100).
计算这个的最佳方法是什么?
对所有记录与所有记录进行强力比较是一场失败的战斗。我的建议是寻求k-最近邻居算法的现成实现,例如通过提供的k-Nearest算法,scikit-learn然后广播所得的索引和距离数组,并进一步研究。
在这种情况下的步骤将是:
1-按照Bryce的建议对特征进行矢量化处理,并让您的矢量化方法返回一个浮点数列表(或numpy数组),其中的浮点数与您的特征一样多
2-使您的scikit-learn nn适合您的数据:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
Run Code Online (Sandbox Code Playgroud)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Run Code Online (Sandbox Code Playgroud)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn