asi*_*ban 5 python parallel-processing numpy dataframe pandas
我的数据框"A"看起来像这样:
type latw lngs late lngn
0 1000 45.457966 9.174864 45.458030 9.174907
1 1000 45.457966 9.174864 45.458030 9.174907
2 1000 45.458030 9.174864 45.458094 9.174907
3 1000 45.458094 9.174864 45.458157 9.174907
4 1000 45.458157 9.174864 45.458221 9.174907
5 1000 45.458221 9.174864 45.458285 9.174907
6 1000 45.458285 9.174864 45.458349 9.174907
7 1000 45.458349 9.174864 45.458413 9.174907
8 1000 45.458413 9.174864 45.458477 9.174907
9 1000 45.458477 9.174864 45.458540 9.174907
10 1000 45.458540 9.174864 45.458604 9.174907
11 1000 45.458604 9.174864 45.458668 9.174907
12 1000 45.458668 9.174864 45.458732 9.174907
13 1000 45.458732 9.174864 45.458796 9.174907
14 1000 45.458796 9.174864 45.458860 9.174907
15 1000 45.458860 9.174864 45.458923 9.174907
16 1000 45.458923 9.174864 45.458987 9.174907
17 1000 45.458987 9.174864 45.459051 9.174907
18 1000 45.459051 9.174864 45.459115 9.174907
19 1000 45.459115 9.174864 45.459179 9.174907
20 1000 45.459179 9.174864 45.459243 9.174907
21 1000 45.459243 9.174864 45.459306 9.174907
22 1000 45.459306 9.174864 45.459370 9.174907
23 1000 45.459370 9.174864 45.459434 9.174907
24 1000 45.459434 9.174864 45.459498 9.174907
25 1000 45.459498 9.174864 45.459562 9.174907
26 1000 45.459562 9.174864 45.459626 9.174907
27 1000 45.459626 9.174864 45.459689 9.174907
28 1000 45.459689 9.174864 45.459753 9.174907
29 1000 45.459753 9.174864 45.459817 9.174907
... ... ... ... ... ...
970 1000 45.460583 9.175545 45.460647 9.175587
971 1000 45.460647 9.175545 45.460711 9.175587
972 1000 45.460711 9.175545 45.460775 9.175587
973 1000 45.460775 9.175545 45.460838 9.175587
974 1000 45.460838 9.175545 45.460902 9.175587
975 1000 45.460902 9.175545 45.460966 9.175587
976 1000 45.460966 9.175545 45.461030 9.175587
977 1000 45.461030 9.175545 45.461094 9.175587
978 1000 45.461094 9.175545 45.461157 9.175587
979 1000 45.461157 9.175545 45.461221 9.175587
980 1000 45.461221 9.175545 45.461285 9.175587
981 1000 45.461285 9.175545 45.461349 9.175587
982 1000 45.461349 9.175545 45.461413 9.175587
983 1000 45.461413 9.175545 45.461477 9.175587
984 1000 45.461477 9.175545 45.461540 9.175587
985 1000 45.461540 9.175545 45.461604 9.175587
986 1000 45.461604 9.175545 45.461668 9.175587
987 1000 45.457966 9.175587 45.458030 9.175630
988 1000 45.458030 9.175587 45.458094 9.175630
989 1000 45.458094 9.175587 45.458157 9.175630
990 1000 45.458157 9.175587 45.458221 9.175630
991 1000 45.458221 9.175587 45.458285 9.175630
992 1000 45.458285 9.175587 45.458349 9.175630
993 1000 45.458349 9.175587 45.458413 9.175630
994 1000 45.458413 9.175587 45.458477 9.175630
995 1000 45.458477 9.175587 45.458540 9.175630
996 1000 45.458540 9.175587 45.458604 9.175630
997 1000 45.458604 9.175587 45.458668 9.175630
998 1000 45.458668 9.175587 45.458732 9.175630
999 1000 45.458732 9.175587 45.458796 9.175630
Run Code Online (Sandbox Code Playgroud)
它有22,000,000行×5列,有数据框"B",如下所示:
type Lat Lng
0 0 45.465739 9.180830
1 2 45.463950 9.187113
2 1 45.468015 9.180648
3 1 45.462209 9.187447
4 0 45.459578 9.184007
5 1 45.459822 9.187034
6 2 45.454988 9.180310
7 2 45.459818 9.189377
8 0 45.462200 9.187440
9 0 45.467160 9.180100
10 2 45.459407 9.183300
11 2 45.457699 9.187434
12 1 45.455319 9.186697
13 0 45.461138 9.191943
14 2 45.456397 9.189028
15 0 45.457062 9.185878
16 1 45.461980 9.187024
17 1 45.464319 9.183142
18 2 45.464227 9.187065
19 1 45.460886 9.185216
Run Code Online (Sandbox Code Playgroud)
它有2,000,000行×3列.我想用"B"替换数据框"A"的类型值,其中:
A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn]
Run Code Online (Sandbox Code Playgroud)
我想检查一个位于B的位置属于A中的哪一个矩形.
PS我正在寻找python中最快的方法,比如使用并行处理.
这个答案相当复杂。以下是高层次要点
np.searchsorted西纬度一起使用。对东纬度重复此操作,但参数为。这告诉我每个西/东纬度落在 中定义的频谱中的位置。 是索引值。因此,如果基于东部的结果大于基于西部的结果,则意味着西部和东部之间存在纬度。
ABside='right'Bnp.searchsortedBB.Lat中间有多少个。我扩展数组来跟踪所有可能的匹配。numpy结构化数组查找二维数组的交集
A和来自 的匹配序数位置B。
B,我选择第一行。使用A并由B@MaxU提供
p = process(A, B)
print(p)
[[3 0]
[5 1]
[9 2]]
Run Code Online (Sandbox Code Playgroud)
显示结果并进行比较。我会把它们并排放置
pd.concat([
A.iloc[p[:, 0]].reset_index(drop=True),
B.iloc[p[:, 1]].reset_index(drop=True)
], axis=1, keys=['A', 'B'])
Run Code Online (Sandbox Code Playgroud)
但是,为了交换type值,我在process函数中放置了一个标志以重新分配。赶紧跑process(A, B, reassign_type=True)
计时
函数
我定义了返回相同内容的函数,并使其成为同类比较
def pir(A, B):
return A.values[process(A, B)[:, 0], 0]
def maxu(A, B):
return B.apply(
lambda x: A.loc[
(A.latw < x.Lat) &
(x.Lat < A.late) &
(A.lngs < x.Lng) &
(x.Lng < A.lngn), 'type'
].head(1),
axis=1
).values.diagonal()
Run Code Online (Sandbox Code Playgroud)
更彻底的测试
from timeit import timeit
from itertools import product
rng = np.arange(20, 51, 5)
idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])
functions = pd.Index(['pir', 'maxu'], name='method')
results = pd.DataFrame(index=idx, columns=functions)
for bi, ai in idx:
n = ai
ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]])
A = pd.DataFrame(
np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]),
columns=['type', 'latw', 'lngs', 'late', 'lngn']
)
m = int(bi ** .5)
prng = np.arange(m) / m * 10 + .5
B = pd.DataFrame(
np.array(list(product(prng, prng))) + np.array([[40, 30]]),
columns=['Lat', "Lng"])
B.insert(0, 'type', np.random.randint(3, size=len(B)))
for j in functions:
results.set_value(
(bi, ai), j,
timeit(
'{}(A, B)'.format(j),
'from __main__ import {}, A, B'.format(j),
number=10
)
)
results.plot(rot=45)
Run Code Online (Sandbox Code Playgroud)
from collections import namedtuple
TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])
def two_sided_search(left, right, a):
# sort `a` and save order for later
argsort = a.argsort()
asorted = a[argsort]
# where does `left` fall in `asorted`
s_left = asorted.searchsorted(left)
# where does `right` fall in `asorted`
s_right = asorted.searchsorted(right, side='right')
rng = np.arange(left.size)
# a match happens when where we found `left`, `right` was found later
match_idx = rng[s_left < s_right]
# ignore positions with no match
left_pos = s_left[match_idx]
right_pos = s_right[match_idx]
# distance from where left was found to where right was found
# this represents how many items in a are wrapped by left and right
diff_pos = right_pos - left_pos
# build an index on `match_idx` paired with all positions in
# `asorted`.
d2 = left_pos - np.append(0, diff_pos[:-1].cumsum())
p1 = np.arange(len(left_pos)).repeat(diff_pos)
p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos)
# returns
# `match_idx` which is an integer slice of either `left` or `right`
# for each element in `match_idx` there are one or more elements in
# `a` that were surrounded by `left` and `right`
# `p1` are the positions in `match_idx` that are paired with `argsort[p2]`
# for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]`
# this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]`
# and `left[match_idx[1]] < a[5] < right[match_idx[1]]`
# and `left[match_idx[2]] < a[12] < right[match_idx[2]]`
# and `left[match_idx[2]] < a[5] < right[match_idx[2]]`
return TwoSided(match_idx, p1, argsort[p2])
def intersectNd(a, b):
# taken from /sf/answers/582218241/
# returns the interesection of 2 2-D arrays
# the strategy is to convert to a structured array then
# use np.intersect1d then convert back to unstructured
ncols = a.shape[1]
dtype = dict(
names=['f{}'.format(i) for i in range(ncols)],
formats=ncols * [a.dtype]
)
return np.intersect1d(
a.view(dtype), b.view(dtype)
).view(a.dtype).reshape(-1, ncols)
def where_in(b1, b2):
# return a mask of lenght len(b2) where True
# when element of b2 is in b1
bsort = b1.argsort()
bsrtd = b1[bsort]
bsrch = bsrtd.searchsorted(b2) % bsrtd.size
return bsrtd[bsrch] == b2
def process(A, B, reassign_type=False):
lats = two_sided_search(
A.latw.values,
A.late.values,
B.Lat.values,
)
# subset A & B
# no point looking for longitude matches
# in rows without a latitude match
lngs = A.lngs.values[lats.match]
lngn = A.lngn.values[lats.match]
u, i = np.unique(lats.found, return_index=1)
lng = B.Lng.values[u]
lons = two_sided_search(
lngs, lngn, lng
)
# `lats.match` is where we found successful latitudes
# `lons.match` is where we found successful longitudes
# since `lons.match` was based on `lats.match` we can slice
# to get the original positions of `A` that satisfy both
# the latw < blat < late & lngs < blng < lngn
match = lats.match[lons.match]
# however, for every row in A that might be a match for B,
# there may be multilple B's
#
# We start addressing this by finding
# where in lats.idx do we have matches
# this gives me a boolean array of lats.match
# index values repeated for each row in B
# that satisfies our conditions
wi = where_in(lons.match, lats.idx)
# a (? x 2) array representing all combinations of
# rows in A that matched the
lon_pos = np.hstack([
lats.match[lons.match[lons.idx], None],
u[lons.found, None]
])
lat_pos = np.hstack([
lats.match[lats.idx[wi], None],
lats.found[wi, None]
])
x = intersectNd(lat_pos, lon_pos)
p, pi = np.unique(x[:, 0], return_index=1)
if reassign_type:
A.iloc[x[pi, 0], A.columns.get_loc('type')] = \
B.iloc[x[pi, 1], B.columns.get_loc('type')].values
else:
return x[pi]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
278 次 |
| 最近记录: |