在其他数据框中找到最近的点（有很多数据）

Question

在其他数据框中找到最近的点（有很多数据）

Arn*_*d H 5 python optimization nearest-neighbor pandas geopandas

问题很简单，我有两个 DataFrame ：

一个有 90 000 套公寓及其纬度/经度
一个有 3 000 个药房及其纬度/经度

我想为我所有的公寓创建一个新变量：“最近药房的距离”

为此，我尝试了两种花费大量时间的方法：

第一种方法：我创建了一个矩阵，其中我的公寓在行中，我的药房在列中，它们之间的距离在交集处，之后我只取矩阵的最小值以获得 90 000 值的列向量

我只是使用 double for with numpy ：

m,n=len(result['latitude']),len(pharma['lat'])
M = np.ones((m,n))
for i in range(m):
     for j in range(n):
        if (result['Code departement'][i]==pharma['departement'][j]):
            M[i,j] =(pharma['lat'][j]-result['latitude'][i])**2+(pharma['lng'][j]-result['longitude'] [i])**2

Run Code Online (Sandbox Code Playgroud)

ps：我知道纬度/经度的公式是错误的，但公寓在同一地区，所以这是一个很好的近似值

第二种方法：我使用这个主题的解决方案（他们是同样的问题，但数据较少） https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe

我使用了geopandas et最近的方法：

from shapely.ops import nearest_points
pts3 = pharma.geometry.unary_union


def near(point, pts=pts3):
     nearest = pharma.geometry == nearest_points(point, pts)[1]
     return pharma[nearest].geometry.get_values()[0]

appart['Nearest'] = appart.apply(lambda row: near(row.geometry), axis=1)

Run Code Online (Sandbox Code Playgroud)

正如我所说，这两种方法都花费了太多时间，在运行 1 小时后我的电脑/笔记本崩溃并且失败了。

我的最后一个问题：你有没有优化的方法来更快？有可能的？如果它已经优化，我将购买另一台 PC，但是要寻找能够进行如此快速计算的 PC 需要哪些标准？

Answer 1

mgc*_*mgc 10

我猜球树是这个任务的合适结构。

您可以使用scikit-learn实现，请参阅下面的代码以获取适合您情况的示例：

import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from sklearn.neighbors import BallTree

## Create the two GeoDataFrame to replicate your dataset
appart = gpd.GeoDataFrame({
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(100000), np.random.rand(100000))
])

pharma = gpd.GeoDataFrame([{
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(3000), np.random.rand(3000))
])

# Create a BallTree 
tree = BallTree(pharma[['x', 'y']].values, leaf_size=2)

# Query the BallTree on each feature from 'appart' to find the distance
# to the nearest 'pharma' and its id
appart['distance_nearest'], appart['id_nearest'] = tree.query(
    appart[['x', 'y']].values, # The input array for the query
    k=1, # The number of nearest neighbors
)

Run Code Online (Sandbox Code Playgroud)

使用这种方法，您可以非常快速地解决您的问题（上面的例子，在我的电脑上，在 100000 个点的输入数据集上，用不到一秒钟的时间找到最近点的索引，在 3000 个点中）。

默认情况下，query方法BallTree是返回到最近邻居的距离及其id。如果需要，您可以通过将return_distance参数设置为来禁用返回此最近邻居的距离False。如果你真的只关心距离，你可以只保存这个值：

appart['distance_nearest'], _ = tree.query(appart[['x', 'y']].values, k=1)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，3 月前
查看次数：	2104 次
最近记录：	6 年，3 月前