向量化以计算许多距离

mac*_*h81 5 python numpy vectorization pandas

我是numpy / pandas和矢量化计算的新手。我正在执行数据任务,其中有两个数据集。数据集1包含一个经度和纬度的位置列表以及一个变量A。数据集2还包含一个经度和纬度的位置列表。对于数据集1中的每个位置,我想计算其到数据集2中所有位置的距离,但我只想获得数据集2中小于变量A的值的位置计数。数据集非常大,因此我需要使用向量化运算来加快计算速度。

例如,我的数据集1可能如下所示:

id lon    lat   varA
1  20.11 19.88  100
2  20.87 18.65  90
3  18.99 20.75  120
Run Code Online (Sandbox Code Playgroud)

我的数据集2可能如下所示:

placeid lon lat 
a       18.75 20.77
b       19.77 22.56
c       20.86 23.76
d       17.55 20.74 
Run Code Online (Sandbox Code Playgroud)

然后对于数据集1中的id == 1,我想计算其到数据集2中所有四个点(a,c,c,d)的距离,并且我想知道有多少距离小于对应的距离varA的值。例如,计算出的四个距离为90、70、120、110,而varA为100。则该值应为2。

我已经有一个向量化函数来计算两对坐标之间的距离。假设函数(haversine(x,y))正确实现,我有以下代码。

dataset2['count'] = dataset1.apply(lambda x: 
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis 
= 1)
Run Code Online (Sandbox Code Playgroud)

但是,这给出的是行的总数,但不能满足我的要求。

谁能指出我如何使代码正常工作?

Alz*_*Alz 3

If you can project the coordinates to a local projection (e.g. UTM), which is pretty straight forward with pyproj and generally more favorable than lon/lat for measurement, then there is a much much MUCH faster way using scipy.spatial. Neither of df['something'] = df.apply(...) and np.vectorize() are not truly vectorized, under the hood, they use looping.

ds1
    id  lon lat varA
0   1   20.11   19.88   100
1   2   20.87   18.65   90
2   3   18.99   20.75   120

ds2
    placeid lon lat
0   a   18.75   20.77
1   b   19.77   22.56
2   c   20.86   23.76
3   d   17.55   20.74


from scipy.spatial import distance

# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11,  19.88],
#       [ 20.87,  18.65],
#       [ 18.99,  20.75]])

distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074,  2.70148108,  3.95182236,  2.70059253],
#       [ 2.99813275,  4.06178532,  5.11000978,  3.92307278],
#       [ 0.24083189,  1.97091349,  3.54358575,  1.44003472]])
Run Code Online (Sandbox Code Playgroud)

distances is in fact distance between every pair of points. coords_a.shape is (3, 2) and coords_b.shape is (4, 2), so the result is (3,4). The default metric for np.distance is eculidean, but there are other metrics as well. For the sake of this example, let's assume vara is:

vara = np.array([2,4.5,2])
Run Code Online (Sandbox Code Playgroud)

(instead of 100 90 120). We need to identify which value in distances in row one is smaller than 2, in row two smaller that 4.5,..., one way to solve this problem is subtracting each value in vara from corresponding row (note that we must resize vara):

vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926,  0.70148108,  1.95182236,  0.70059253],
#       [-1.50186725, -0.43821468,  0.61000978, -0.57692722],
#       [-1.75916811, -0.02908651,  1.54358575, -0.55996528]])
Run Code Online (Sandbox Code Playgroud)

then setting positive values to zero and making negative values positive will give us the final array:

res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926,  0.        ,  0.        ,  0.        ],
#            [ 1.50186725,  0.43821468,  0.        ,  0.57692722],
#            [ 1.75916811,  0.02908651,  0.        ,  0.55996528]])
Run Code Online (Sandbox Code Playgroud)

Now, to sum over each row:

sum_ = res.sum(axis=1)
#out:  array([ 0.37466926,  2.51700915,  2.34821989])
Run Code Online (Sandbox Code Playgroud)

and to count the items in each row:

count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])
Run Code Online (Sandbox Code Playgroud)

This is a fully vectorized (custom) solution which you can tweak to your liking and should accommodate any level of complexity. yet another solution is cKDTree. the code is from documentation. it should be fairly easy to adopt it to your problem, but in case you need assistance don't hesitate to ask.

x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]
Run Code Online (Sandbox Code Playgroud)

query_ball_point() finds all points within distance r of point(s) x, and it is amazingly fast.

one final note: don't use these algorithms with lon/lat input, particularly if your area of interest is far from equator, because the error can get huge.

UPDATE:

To project your coordinates, you need to convert from WGS84 (lon/lat) to appropriate UTM. To find out which utm zone you should project to use epsg.io.

lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)

Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)
Run Code Online (Sandbox Code Playgroud)

You can do df.apply() and use Proj_to_... to project df.