在pandas数据帧中获取彼此的最近点

Shu*_*m R 4 python numpy dataframe pandas

我有一个数据帧:

  routeId  latitude_value  longitude_value
  r1       28.210216        22.813209
  r2       28.216103        22.496735
  r3       28.161786        22.842318
  r4       28.093110        22.807081
  r5       28.220370        22.503500
  r6       28.220370        22.503500
  r7       28.220370        22.503500
Run Code Online (Sandbox Code Playgroud)

从这里我想生成一个像这样的数据帧df2:

routeId    nearest
  r1         r3         (for example)
  r2       ...    similarly for all the routes.
Run Code Online (Sandbox Code Playgroud)

我试图实现的逻辑是

对于每条路线,我应该找到所有其他路线的欧氏距离.并在routeId上迭代它.

有一个计算欧氏距离的功能.

dist = math.hypot(x2 - x1, y2 - y1)
Run Code Online (Sandbox Code Playgroud)

但我很困惑如何构建一个函数,我将传递一个数据帧,或使用.apply()

def  get_nearest_route():
    .....
    return df2
Run Code Online (Sandbox Code Playgroud)

Flo*_*oor 7

我们可以使用scipy.spatial.distance.cdist或多个for循环,然后用路由替换min并找到最接近的即

mat = scipy.spatial.distance.cdist(df[['latitude_value','longitude_value']], 
                              df[['latitude_value','longitude_value']], metric='euclidean')

# If you dont want scipy, you can use plain python like 
# import math
# mat = []
# for i,j in zip(df['latitude_value'],df['longitude_value']):
#     k = []
#     for l,m in zip(df['latitude_value'],df['longitude_value']):
#         k.append(math.hypot(i - l, j - m))
#     mat.append(k)
# mat = np.array(mat)

new_df = pd.DataFrame(mat, index=df['routeId'], columns=df['routeId']) 
Run Code Online (Sandbox Code Playgroud)

输出 new_df

routeId        r1        r2        r3        r4        r5        r6        r7
routeId                                                                      
r1       0.000000  0.316529  0.056505  0.117266  0.309875  0.309875  0.309875
r2       0.316529  0.000000  0.349826  0.333829  0.007998  0.007998  0.007998
r3       0.056505  0.349826  0.000000  0.077188  0.343845  0.343845  0.343845
r4       0.117266  0.333829  0.077188  0.000000  0.329176  0.329176  0.329176
r5       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r6       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r7       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000    

#Replace minimum distance with column name and not the minimum with `False`.
# new_df[new_df != 0].min(),0). This gives a mask matching minimum other than zero.  
closest = np.where(new_df.eq(new_df[new_df != 0].min(),0),new_df.columns,False)

# Remove false from the array and get the column names as list . 
df['close'] = [i[i.astype(bool)].tolist() for i in closest]


 routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500          [r2]
5      r6       28.220370        22.503500          [r2]
6      r7       28.220370        22.503500          [r2] 
Run Code Online (Sandbox Code Playgroud)

如果您不想忽略零,那么

# Store the array values in a variable
arr = new_df.values
# We dont want to find mimimum to be same point, so replace diagonal by nan
arr[np.diag_indices_from(new_df)] = np.nan

# Replace the non nan min with column name and otherwise with false
new_close = np.where(arr == np.nanmin(arr, axis=1)[:,None],new_df.columns,False)

# Get column names ignoring false. 
df['close'] = [i[i.astype(bool)].tolist() for i in new_close]

   routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500      [r6, r7]
5      r6       28.220370        22.503500      [r5, r7]
6      r7       28.220370        22.503500      [r5, r6]
Run Code Online (Sandbox Code Playgroud)


jo9*_*o9k 6

我建议使用scipy.spatial.distance中的pdist函数.

matrix = scipy.spatial.distance.pdist(df[['latitude_value', 'longitude_value']], metric='euclidean')
Run Code Online (Sandbox Code Playgroud)

将返回缩小的形状距离矩阵(n,),并计算所有成对距离.

然后你可以使用squareform得到方形成对距离矩阵:

matrix = scipy.spatial.distance.squareform(matrix)
Run Code Online (Sandbox Code Playgroud)

然后对于每一行,matrix[i]您可以在索引处找到最大值,例如matrix[i][j],您知道对于第i个点,其最近点是第j个点.