use*_*224 5 python python-3.x pandas
我有两个数据框,两个数据框都包含经度和纬度列。我想根据经度和纬度列合并这两个数据框。首先,我应用了一个普通merge函数,它生成了空的结果数据框。我调查发现两个数据帧没有相同的经度和纬度列。然后我尝试了另一个名为 的函数merge_asof并将方向设置为nearest,这意味着应该根据经度和纬度最接近的值进行合并。我的代码如下所示:
pd.merge_asof(df1,df2,left_on=['x','y'],right_on=['Long','Lat'],direction='nearest')
Run Code Online (Sandbox Code Playgroud)
当我运行上面的代码时,它抛出一条错误消息:
pd.merge_asof(df1,df2,left_on=['x','y'],right_on=['Long','Lat'],direction='nearest')
Run Code Online (Sandbox Code Playgroud)
当我在 Google 上搜索该消息时,我发现(https://github.com/pandas-dev/pandas/issues/20369)我们无法在merge_asof函数中使用列的组合。我该如何解决这个问题?
这是一个老问题,但由于它是今天早上编辑的,希望这个答案对某人仍然有用。
\n我想到的是最近邻搜索,以匹配最接近且在一定容差范围内的行。当然,纬度、经度的距离有点棘手。赤道1度经线的距离远大于靠近两极1度的距离。此外,尽管经度差异很大,但从经度+180到-180的跨越可能是非常小的实际欧几里得距离。
\n鉴于这些原因,我们可以采取以下措施:
\nlat, lon位置的 3D 坐标。scipy.spatial.KDTree有效地寻找邻居。但首先,让我们编写一个函数来生成最小的可重现示例:
\ndef gen(n, dlatlon=(1, 1)):\n # latitude goes from -90 to 90 degrees, longitude from -180 to 180\n # here we don\'t care about exceeding those limits -- 2xyz will take\n # care of it\n upper = np.array([90, 180])\n latlon0 = np.random.uniform(-1, 1, (n, 2)) * upper\n latlon1 = latlon0 + np.random.normal(size=(n, 2)) * dlatlon\n i = np.arange(n)\n df0 = pd.DataFrame(latlon0, columns=[\'lat\', \'lon\']).assign(a=i)\n df1 = pd.DataFrame(latlon1, columns=[\'lat\', \'lon\']).assign(b=i)\n return df0, df1\nRun Code Online (Sandbox Code Playgroud)\n例子:
\nnp.random.seed(0) # reproducible example\ndf0, df1 = gen(6)\n\n>>> df0\n lat lon a\n0 8.786431 77.468172 0\n1 18.497408 16.157946 1\n2 -13.742136 52.521881 2\n3 -11.234302 141.038280 3\n4 83.459297 -41.961053 4\n5 52.510507 10.402171 5\n\n>>> df1\n lat lon b\n0 9.547468 77.589847 0\n1 18.941271 16.491620 1\n2 -12.248057 52.316722 2\n3 -10.921234 140.184185 3\n4 80.906307 -41.307435 4\n5 53.374943 9.660006 5\nRun Code Online (Sandbox Code Playgroud)\n现在,让我们编写一个到 3D 的转换(类似于我在这里的答案):
\nR_earth = 6371 # in km, ignoring flattening\n\n\ndef latlon2xyz(latlon, R=R_earth):\n latlon = latlon.to_numpy() if isinstance(latlon, pd.DataFrame) else latlon\n lat, lon = np.deg2rad(latlon).T\n \n # conversion (latitude, longitude, altitude) to (x, y, z)\n # see https://stackoverflow.com/a/10788250/758174\n # we use alt and f = 0 -> F = 1, S = C\n coslat = np.cos(lat)\n sinlat = np.sin(lat)\n C = 1 / np.sqrt(coslat**2 + sinlat**2)\n\n x = C * coslat * np.cos(lon)\n y = C * coslat * np.sin(lon)\n z = C * sinlat\n \n return R * np.c_[x, y, z]\n\ndef append3d(df):\n return df.assign(**dict(zip(\'xyz\', latlon2xyz(df[[\'lat\', \'lon\']]).T)))\nRun Code Online (Sandbox Code Playgroud)\n在我们上面的示例数据中:
\n>>> append3d(df0)\n lat lon a x y z\n0 8.786431 77.468172 0 1366.168859 6146.229829 973.181661\n1 18.497408 16.157946 1 5803.197072 1681.366619 2021.274608\n2 -13.742136 52.521881 2 3765.522822 4911.207155 -1513.447441\n3 -11.234302 141.038280 3 -4858.951861 3929.329400 -1241.208394\n4 83.459297 -41.961053 4 539.640845 -485.230993 6329.532340\n5 52.510507 10.402171 5 3813.764098 700.106078 5055.165267\n\n>>> append3d(df1)\n lat lon b x y z\n0 9.547468 77.589847 0 1350.216201 6135.950784 1056.723796\n1 18.941271 16.491620 1 5778.118840 1710.637588 2068.019033\n2 -12.248057 52.316722 2 3805.920396 4927.257031 -1351.572821\n3 -10.921234 140.184185 3 -4804.978175 4005.604427 -1207.045530\n4 80.906307 -41.307435 4 756.386071 -664.675318 6290.924243\n5 53.374943 9.660006 5 3746.893184 637.776657 5113.088440\nRun Code Online (Sandbox Code Playgroud)\n然后,我们可以使用KDTree查找最近邻居,并将这些邻居限制在给定的最大欧几里德距离 内r,对应于给定的纬度和经度偏差(在赤道处):
from scipy.spatial import KDTree\n\ndef latlon_merge(df0, df1, r=100):\n # r: maximum distance, in km, between two points for them\n # to be considered close neighbors\n kd = KDTree(latlon2xyz(df0[[\'lat\', \'lon\']]))\n dist, idx = kd.query(latlon2xyz(df1[[\'lat\', \'lon\']]), distance_upper_bound=r)\n key = \'_ix_\'\n assert key not in df0.columns\n assert key not in df1.columns\n df = pd.merge(\n df0.assign(**{key: np.arange(df0.shape[0])}),\n df1.assign(**{key: idx}),\n \'outer\', on=key)\n return df.drop(key, axis=1)\nRun Code Online (Sandbox Code Playgroud)\n请注意,r赤道处的纬度和经度相差 2 度,约为 314.5 公里:
r = np.linalg.norm(np.subtract(*latlon2xyz(np.array(((-1, -1), (1, 1))))))\n\n>>> r\n314.4668312040188\nRun Code Online (Sandbox Code Playgroud)\n在我们的示例数据上:
\n>>> latlon_merge(df0, df1, r)\n lat_x lon_x a lat_y lon_y b\n0 8.786431 77.468172 0 9.547468 77.589847 0\n1 18.497408 16.157946 1 18.941271 16.491620 1\n2 -13.742136 52.521881 2 -12.248057 52.316722 2\n3 -11.234302 141.038280 3 -10.921234 140.184185 3\n4 83.459297 -41.961053 4 80.906307 -41.307435 4\n5 52.510507 10.402171 5 53.374943 9.660006 5\nRun Code Online (Sandbox Code Playgroud)\n在本例中,我们将所有正确的行合并在一起。但是,如果我们将容差设置得r足够小,以至于上面的某些行不匹配怎么办?
>>> latlon_merge(df0, df1, r=150)\n lat_x lon_x a lat_y lon_y b\n0 8.786431 77.468172 0.0 9.547468 77.589847 0.0\n1 18.497408 16.157946 1.0 18.941271 16.491620 1.0\n2 -13.742136 52.521881 2.0 NaN NaN NaN\n3 -11.234302 141.038280 3.0 -10.921234 140.184185 3.0\n4 83.459297 -41.961053 4.0 NaN NaN NaN\n5 52.510507 10.402171 5.0 53.374943 9.660006 5.0\n6 NaN NaN NaN -12.248057 52.316722 2.0\n7 NaN NaN NaN 80.906307 -41.307435 4.0\nRun Code Online (Sandbox Code Playgroud)\n让我们看看这在规模上的效率如何。
\ndf0, df1 = gen(100_000)\n%timeit latlon_merge(df0, df1)\n# 161 ms \xc2\xb1 2.5 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
1383 次 |
| 最近记录: |