基于坐标合并数据框

use*_*224 5 python python-3.x pandas

我有两个数据框,两个数据框都包含经度和纬度列。我想根据经度和纬度列合并这两个数据框。首先,我应用了一个普通merge函数,它生成了空的结果数据框。我调查发现两个数据帧没有相同的经度和纬度列。然后我尝试了另一个名为 的函数merge_asof并将方向设置为nearest,这意味着应该根据经度和纬度最接近的值进行合并。我的代码如下所示:

pd.merge_asof(df1,df2,left_on=['x','y'],right_on=['Long','Lat'],direction='nearest')
Run Code Online (Sandbox Code Playgroud)

当我运行上面的代码时,它抛出一条错误消息:

pd.merge_asof(df1,df2,left_on=['x','y'],right_on=['Long','Lat'],direction='nearest')
Run Code Online (Sandbox Code Playgroud)

当我在 Google 上搜索该消息时,我发现(https://github.com/pandas-dev/pandas/issues/20369)我们无法在merge_asof函数中使用列的组合。我该如何解决这个问题?

Pie*_*e D 2

这是一个老问题,但由于它是今天早上编辑的,希望这个答案对某人仍然有用。

\n

我想到的是最近邻搜索,以匹配最接近且在一定容差范围内的行。当然,纬度、经度的距离有点棘手。赤道1度经线的距离远大于靠近两极1度的距离。此外,尽管经度差异很大,但从经度+180到-180的跨越可能是非常小的实际欧几里得距离。

\n

鉴于这些原因,我们可以采取以下措施:

\n
    \n
  1. 计算每个lat, lon位置的 3D 坐标。
  2. \n
  3. 用于scipy.spatial.KDTree有效地寻找邻居。
  4. \n
\n

但首先,让我们编写一个函数来生成最小的可重现示例

\n
def gen(n, dlatlon=(1, 1)):\n    # latitude goes from -90 to 90 degrees, longitude from -180 to 180\n    # here we don\'t care about exceeding those limits -- 2xyz will take\n    # care of it\n    upper = np.array([90, 180])\n    latlon0 = np.random.uniform(-1, 1, (n, 2)) * upper\n    latlon1 = latlon0 + np.random.normal(size=(n, 2)) * dlatlon\n    i = np.arange(n)\n    df0 = pd.DataFrame(latlon0, columns=[\'lat\', \'lon\']).assign(a=i)\n    df1 = pd.DataFrame(latlon1, columns=[\'lat\', \'lon\']).assign(b=i)\n    return df0, df1\n
Run Code Online (Sandbox Code Playgroud)\n

例子:

\n
np.random.seed(0)  # reproducible example\ndf0, df1 = gen(6)\n\n>>> df0\n         lat         lon  a\n0   8.786431   77.468172  0\n1  18.497408   16.157946  1\n2 -13.742136   52.521881  2\n3 -11.234302  141.038280  3\n4  83.459297  -41.961053  4\n5  52.510507   10.402171  5\n\n>>> df1\n         lat         lon  b\n0   9.547468   77.589847  0\n1  18.941271   16.491620  1\n2 -12.248057   52.316722  2\n3 -10.921234  140.184185  3\n4  80.906307  -41.307435  4\n5  53.374943    9.660006  5\n
Run Code Online (Sandbox Code Playgroud)\n

现在,让我们编写一个到 3D 的转换(类似于我在这里的答案):

\n
R_earth = 6371  # in km, ignoring flattening\n\n\ndef latlon2xyz(latlon, R=R_earth):\n    latlon = latlon.to_numpy() if isinstance(latlon, pd.DataFrame) else latlon\n    lat, lon = np.deg2rad(latlon).T\n    \n    # conversion (latitude, longitude, altitude) to (x, y, z)\n    # see https://stackoverflow.com/a/10788250/758174\n    # we use alt and f = 0 -> F = 1, S = C\n    coslat = np.cos(lat)\n    sinlat = np.sin(lat)\n    C      = 1 / np.sqrt(coslat**2 + sinlat**2)\n\n    x = C * coslat * np.cos(lon)\n    y = C * coslat * np.sin(lon)\n    z = C * sinlat\n    \n    return R * np.c_[x, y, z]\n\ndef append3d(df):\n    return df.assign(**dict(zip(\'xyz\', latlon2xyz(df[[\'lat\', \'lon\']]).T)))\n
Run Code Online (Sandbox Code Playgroud)\n

在我们上面的示例数据中:

\n
>>> append3d(df0)\n         lat         lon  a            x            y            z\n0   8.786431   77.468172  0  1366.168859  6146.229829   973.181661\n1  18.497408   16.157946  1  5803.197072  1681.366619  2021.274608\n2 -13.742136   52.521881  2  3765.522822  4911.207155 -1513.447441\n3 -11.234302  141.038280  3 -4858.951861  3929.329400 -1241.208394\n4  83.459297  -41.961053  4   539.640845  -485.230993  6329.532340\n5  52.510507   10.402171  5  3813.764098   700.106078  5055.165267\n\n>>> append3d(df1)\n         lat         lon  b            x            y            z\n0   9.547468   77.589847  0  1350.216201  6135.950784  1056.723796\n1  18.941271   16.491620  1  5778.118840  1710.637588  2068.019033\n2 -12.248057   52.316722  2  3805.920396  4927.257031 -1351.572821\n3 -10.921234  140.184185  3 -4804.978175  4005.604427 -1207.045530\n4  80.906307  -41.307435  4   756.386071  -664.675318  6290.924243\n5  53.374943    9.660006  5  3746.893184   637.776657  5113.088440\n
Run Code Online (Sandbox Code Playgroud)\n

然后,我们可以使用KDTree查找最近邻居,并将这些邻居限制在给定的最大欧几里德距离 内r,对应于给定的纬度和经度偏差(在赤道处):

\n
from scipy.spatial import KDTree\n\ndef latlon_merge(df0, df1, r=100):\n    # r: maximum distance, in km, between two points for them\n    # to be considered close neighbors\n    kd = KDTree(latlon2xyz(df0[[\'lat\', \'lon\']]))\n    dist, idx = kd.query(latlon2xyz(df1[[\'lat\', \'lon\']]), distance_upper_bound=r)\n    key = \'_ix_\'\n    assert key not in df0.columns\n    assert key not in df1.columns\n    df = pd.merge(\n        df0.assign(**{key: np.arange(df0.shape[0])}),\n        df1.assign(**{key: idx}),\n        \'outer\', on=key)\n    return df.drop(key, axis=1)\n
Run Code Online (Sandbox Code Playgroud)\n

请注意,r赤道处的纬度和经度相差 2 度,约为 314.5 公里:

\n
r = np.linalg.norm(np.subtract(*latlon2xyz(np.array(((-1, -1), (1, 1))))))\n\n>>> r\n314.4668312040188\n
Run Code Online (Sandbox Code Playgroud)\n

在我们的示例数据上:

\n
>>> latlon_merge(df0, df1, r)\n       lat_x       lon_x  a      lat_y       lon_y  b\n0   8.786431   77.468172  0   9.547468   77.589847  0\n1  18.497408   16.157946  1  18.941271   16.491620  1\n2 -13.742136   52.521881  2 -12.248057   52.316722  2\n3 -11.234302  141.038280  3 -10.921234  140.184185  3\n4  83.459297  -41.961053  4  80.906307  -41.307435  4\n5  52.510507   10.402171  5  53.374943    9.660006  5\n
Run Code Online (Sandbox Code Playgroud)\n

在本例中,我们将所有正确的行合并在一起。但是,如果我们将容差设置得r足够小,以至于上面的某些行不匹配怎么办?

\n
>>> latlon_merge(df0, df1, r=150)\n       lat_x       lon_x    a      lat_y       lon_y    b\n0   8.786431   77.468172  0.0   9.547468   77.589847  0.0\n1  18.497408   16.157946  1.0  18.941271   16.491620  1.0\n2 -13.742136   52.521881  2.0        NaN         NaN  NaN\n3 -11.234302  141.038280  3.0 -10.921234  140.184185  3.0\n4  83.459297  -41.961053  4.0        NaN         NaN  NaN\n5  52.510507   10.402171  5.0  53.374943    9.660006  5.0\n6        NaN         NaN  NaN -12.248057   52.316722  2.0\n7        NaN         NaN  NaN  80.906307  -41.307435  4.0\n
Run Code Online (Sandbox Code Playgroud)\n

速度

\n

让我们看看这在规模上的效率如何。

\n
df0, df1 = gen(100_000)\n%timeit latlon_merge(df0, df1)\n# 161 ms \xc2\xb1 2.5 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n