给定两个二维点列表，如何为第一个列表中的每个点找到第二个列表中最接近的点？

Question

给定两个二维点列表，如何为第一个列表中的每个点找到第二个列表中最接近的点？

我有两个随机排序的 2d 点的大型 numpy 数组，假设它们是 A 和 B。我需要做的是找到两个数组之间“匹配”的数量，其中匹配是 A 中的一个点（称之为A'）在某个给定半径 R 内，并且 B 中的一个点（称为 B'）。这意味着 A 中的每个点都必须与 B 中的 1 个点匹配或没有点匹配。返回两个数组之间匹配的列表索引也很好，但这不是必需的。由于在这个半径R内可以有很多点，所以似乎最好找到B中距离A'最近的点，然后检查它是否在半径R内。这可以简单地用距离公式来测试dx^2 + dy^2。显然，有一个遍历两个数组的强力 O(n^2) 解决方案，但我需要更快的东西，希望 O(n log n) 。

我所看到的是，Voronoi 图可以用于解决这样的问题，但是我不确定这是如何实现的。我不熟悉 Voronoi 图，所以我用生成它scipy.spatial.Voronoi。是否有使用这些图来解决此问题的快速算法，或者还有其他算法吗？

Answer 1

Air*_*uid 10

我认为有几种选择。我进行了一个小型比较测试来探索一些。前几个仅找出有多少个点相互在彼此的半径内，以确保我在问题的主要部分上获得一致的结果。它没有回答关于寻找最接近的问题的邮件，我认为这只是对其中一些进行更多的工作——最后一个选项是这样做的，请参见帖子底部。问题的驱动因素是进行所有比较，我认为您可以通过某种排序（这里的最后一个概念）来限制比较。

朴素的Python

使用蛮力点对点比较。显然 O(n^2)。

Scipy的`cdist`模块

对于“小”数据来说效果很好且速度最快。对于大数据，由于内存中矩阵输出的大小，这开始爆炸。对于 1M x 1M 应用程序可能不可行。

Scipy的`KDTree`模块

来自其他解决方案。快，但不如“切片”（如下）那么快cdist。也许有一种不同的方法来使用 KDTree 来完成这项任务......我对此不是很有经验。这种方法（如下）似乎合乎逻辑。

划分比较数组

这非常有效，因为您并不对所有距离感兴趣，您只想要半径内的距离。因此，通过对目标数组进行排序并仅在其周围的矩形窗口内查找“竞争者”，您可以使用本机 python 获得非常快的性能，并且不会出现“内存爆炸”。可能仍然有一点“留在桌子上”，可以通过嵌入cdist到这个实现中或（吞下）尝试多线程来增强。

其他想法...

这是一个紧密的“数学”循环，因此在 cython 中尝试一些东西或拆分一个数组和多线程将是新颖的。对结果进行腌制，这样您就不必经常运行这似乎是谨慎的做法。

我认为您可以很容易地使用数组中的索引来扩充元组，以获得匹配列表。

我的旧 iMac 通过切片在 90 秒内完成 100K x 100K，因此这对于 1M x 1M 来说并不是一个好兆头

比较：

# distance checker

from random import uniform
import time
import numpy as np
from scipy.spatial import distance, KDTree
from bisect import bisect
from operator import itemgetter
import sys
from matplotlib import pyplot as plt
sizes = [100, 500, 1000, 2000, 5000, 10000, 20000]
#sizes = [20_000, 30_000, 40_000, 50_000, 60_000]   # for the playoffs.  :)
naive_times = []
cdist_times = []
kdtree_times = []
sectioned_times = []
delta = 0.1

for size in sizes:
    print(f'\n *** running test with vectors of size {size} ***')
    r = 20  # radius to match
    r_squared = r**2

    A = [(uniform(-1000,1000), uniform(-1000,1000)) for t in range(size)]
    B = [(uniform(-1000,1000), uniform(-1000,1000)) for t in range(size)]

    # naive python
    print('naive python')
    tic = time.time()
    matches = [(p1, p2) for p1 in A
                        for p2 in B
                        if (p1[0] - p2[0])**2 + (p1[1] - p2[1])**2 <= r_squared]

    toc = time.time()
    print(f'found: {len(matches)}')
    naive_times.append(toc-tic)
    print(toc-tic)
    print()

    # using cdist module
    print('cdist')
    tic = time.time()
    dist_matrix = distance.cdist(A, B, 'euclidean')
    result = np.count_nonzero(dist_matrix<=r)
    toc = time.time()
    print(f'found: {result}')
    cdist_times.append(toc-tic)
    print(toc-tic)
    print()

    # KDTree
    print('KDTree')
    tic = time.time()
    my_tree = KDTree(A)
    results = my_tree.query_ball_point(B, r=r)
    # for count, r in enumerate(results):
    #   for t in r:
    #       print(count, A[t])

    result = sum(len(lis) for lis in results)
    toc = time.time()
    print(f'found: {result}')
    kdtree_times.append(toc-tic)
    print(toc-tic)
    print()

    # python with sort and sectioning
    print('with sort and sectioning')
    result = 0
    tic = time.time()
    B.sort()
    for point in A:
        # gather the neighborhood in x-dimension within x-r <= x <= x+r+1
        # if this has any merit, we could "do it again" for y-coord....
        contenders = B[bisect(B,(point[0]-r-delta, 0)) : bisect(B,(point[0]+r+delta, 0))]
        # further chop down to the y-neighborhood
        # flip the coordinate to support bisection by y-value
        contenders = list(map(lambda p: (p[1], p[0]), contenders))
        contenders.sort()
        contenders = contenders[bisect(contenders,(point[1]-r-delta, 0)) : 
                                bisect(contenders,(point[1]+r+delta, 0))]
        # note (x, y) in contenders is still inverted, so need to index properly
        matches = [(point, p2) for p2 in contenders if (point[0] - p2[1])**2 + (point[1] - p2[0])**2 <= r_squared]
        result += len(matches)
    toc = time.time()
    print(f'found: {result}')
    sectioned_times.append(toc-tic)
    print(toc-tic)
print('complete.')

plt.plot(sizes, naive_times, label = 'naive')
plt.plot(sizes, cdist_times, label = 'cdist')
plt.plot(sizes, kdtree_times, label = 'kdtree')
plt.plot(sizes, sectioned_times, label = 'sectioning')
plt.legend()
plt.show()

Run Code Online (Sandbox Code Playgroud)

尺寸和图之一的结果：

 *** running test with vectors of size 20000 ***
naive python
found: 124425
101.40657806396484

cdist
found: 124425
2.9293079376220703

KDTree
found: 124425
18.166933059692383

with sort and sectioning
found: 124425
2.3414530754089355
complete.

Run Code Online (Sandbox Code Playgroud)

注意：在第一个图中，cdist覆盖了sectioning. 季后赛显示在第二张图中。

“季后赛”

修改后的切片代码

此代码查找半径内点内的最小值。运行时相当于上面的分段代码。

print('with sort and sectioning, and min finding')
result = 0
pairings = {}  
tic = time.time()
B.sort()
def dist_squared(a, b): 
    # note (x, y) in point b will be inverted (below), so need to index properly
    return (a[0] - b[1])**2 + (a[1] - b[0])**2
for idx, point in enumerate(A):
    # gather the neighborhood in x-dimension within x-r <= x <= x+r+1
    # if this has any merit, we could "do it again" for y-coord....
    contenders = B[bisect(B,(point[0]-r-delta, 0)) : bisect(B,(point[0]+r+delta, 0))]
    # further chop down to the y-neighborhood
    # flip the coordinate to support bisection by y-value
    contenders = list(map(lambda p: (p[1], p[0]), contenders))
    contenders.sort()
    contenders = contenders[bisect(contenders,(point[1]-r-delta, 0)) : 
                            bisect(contenders,(point[1]+r+delta, 0))]
    matches = [(dist_squared(point, p2), point, p2) for p2 in contenders 
        if dist_squared(point, p2) <= r_squared]
    if matches:
        pairings[idx] = min(matches)[1]  # pair the closest point in B with the point in A
toc = time.time()
print(toc-tic)

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，10 月前
查看次数：	4633 次
最近记录：	4 年前

给定两个二维点列表，如何为第一个列表中的每个点找到第二个列表中最接近的点？

朴素的Python

Scipy的cdist模块

Scipy的KDTree模块

划分比较数组

其他想法...

“季后赛”

修改后的切片代码

Scipy的`cdist`模块

Scipy的`KDTree`模块