在 numpy 数组 ((x, y, z)...) 中搜索与最近的 x, y 匹配的 z

Question

在 numpy 数组 ((x, y, z)...) 中搜索与最近的 x, y 匹配的 z

tro*_*rau 5 python interpolation numpy scipy

我有一个非常大的数组，类似于格式的高程数据：

triplets = ((x0, y0, z0), 
            (x1, y1, z1), 
            ... ,
            (xn, yn, zn))

Run Code Online (Sandbox Code Playgroud)

其中 x, y, z 都是以米为单位的浮点数。您可以创建与此格式匹配的合适测试数据：

x = np.arange(20, 40, dtype=np.float64)
y = np.arange(30, 50, dtype=np.float64)
z = np.random.random(20) * 25.0
triplets = np.hstack((x, y, z)).reshape((len(x),3))

Run Code Online (Sandbox Code Playgroud)

我希望能够有效地找到给定 (x,y) 对的相应 z 值。到目前为止，我的研究引出了更多问题。这是我所拥有的：

遍历所有三元组：

query = (a, b) # where a, b are the x and y coordinates we're looking for
for i in triplets:
  if i[0] == query[0] and i[1] == query[1]:
    result = i[2]

Run Code Online (Sandbox Code Playgroud)

缺点：慢；a, b必须存在，这是比较浮点数的问题。

使用scipy.spatial.cKDTree找到最近的：

points = triplets[:,0:2] # drops the z column
tree = cKDTree(points)
idx = tree.query((a, b))[1] # this returns a tuple, we want the index
query = tree.data[idx]
result = triplets[idx, 2]

Run Code Online (Sandbox Code Playgroud)

缺点：返回最近点而不是插值。

interp2d根据评论使用：
```
f = interp2d(x, y, z)
result = f(a, b)
```
Run Code Online (Sandbox Code Playgroud)
缺点：不适用于大型数据集。我OverflowError: Too many data points to interpolate在实际数据上运行时得到。（我的真实数据大约是 1100 万点。）

所以问题是：是否有任何直接的方法可以让我忽略？有没有办法减少上述缺点？

Answer 1

ali*_*i_m 4

如果您想对结果进行插值，而不仅仅是查找最近邻居的 z 值，我会考虑执行如下操作：

使用 kd 树根据数据点的(x, y)坐标对数据点进行分区
对于要插值的给定(xi, yi)点，找到其k 个最近邻居
取它们的z值的平均值，根据它们与(xi, yi)的距离进行加权

代码可能看起来像这样：

import numpy as np
from scipy.spatial import cKDTree

# some fake (x, y, z) data
XY = np.random.rand(10000, 2) - 0.5
Z = np.exp(-((XY ** 2).sum(1) / 0.1) ** 2)

# construct a k-d tree from the (x, y) coordinates
tree = cKDTree(XY)

# a random point to query
xy = np.random.rand(2) - 0.5

# find the k nearest neighbours (say, k=3)
distances, indices = tree.query(xy, k=3)

# the z-values for the k nearest neighbours of xy
z_vals = Z[indices]

# take the average of these z-values, weighted by 1 / distance from xy
dw_avg = np.average(z_vals, weights=(1. / distances))

Run Code Online (Sandbox Code Playgroud)

值得尝试一下k的值，即要取平均值的最近邻居的数量。这本质上是核密度估计的一种粗略形式，其中k的值控制您对 z 值的基础分布施加的“平滑度”程度。k越大，平滑度越高。

同样，您可能想尝试一下如何根据点与(xi, yi)的距离对点的贡献进行加权，具体取决于您认为z的相似性如何随着x, y距离的增加而减少。例如，您可能想要通过(1 / distances ** 2)而不是进行加权(1 / distances)。

从性能上来说，构建和搜索kd树都非常高效。请记住，您只需为数据集构建一次树，如果需要，您可以通过将(N, 2)数组传递给来一次查询多个点tree.query()。

用于近似最近邻搜索的工具（例如FLANN ）可能会更快，但在数据维度非常高的情况下，这些工具通常更有用。

归档时间：	11 年，6 月前
查看次数：	1617 次
最近记录：	10 年，5 月前