如何用Spark查找最近的10亿条记录?

Osi*_*ris 6 nearest-neighbor euclidean-distance apache-spark pyspark spark-dataframe

鉴于包含以下信息的10亿条记录:

    ID  x1  x2  x3  ... x100
    1   0.1  0.12  1.3  ... -2.00
    2   -1   1.2    2   ... 3
    ...
Run Code Online (Sandbox Code Playgroud)

对于上面的每个ID,我想找到前10个最接近的ID,基于它们的向量的欧几里德距离(x1,x2,...,x100).

计算这个的最佳方法是什么?

xen*_*yon 7

碰巧的是,我有一个解决方案,包括将sklearn与Spark结合起来:https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun /

它的要点是:

  • 集中使用sklearn的k-NN fit()方法
  • 但是然后分布式地使用sklearn的k-NN kneighbors()方法


arc*_*nic 5

对所有记录与所有记录进行强力比较是一场失败的战斗。我的建议是寻求k-最近邻居算法的现成实现,例如通过提供的k-Nearest算法,scikit-learn然后广播所得的索引和距离数组,并进一步研究。

在这种情况下的步骤将是:

1-按照Bryce的建议对特征进行矢量化处理,并让您的矢量化方法返回一个浮点数列表(或numpy数组),其中的浮点数与您的特征一样多

2-使您的scikit-learn nn适合您的数据:

nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
Run Code Online (Sandbox Code Playgroud)

3- run the trained algorithm on your vectorized data (training and query data are the same in your case)

distances, indices = nbrs.kneighbors(qpa)
Run Code Online (Sandbox Code Playgroud)

Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.

Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn

  • 实际上,答案中的第3步是*可并行化的:sklearn的k-NN kneighbors()方法可以与Spark一起分发!我在这里发布了如何:https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/ (4认同)