快速(矢量化)方法,以从第二DF中找到属于同等大小的矩形(由两个点给出)的一个DF中的点

asi*_*ban 5 python parallel-processing numpy dataframe pandas

我的数据框"A"看起来像这样:

type    latw    lngs    late    lngn
0   1000    45.457966   9.174864    45.458030   9.174907
1   1000    45.457966   9.174864    45.458030   9.174907
2   1000    45.458030   9.174864    45.458094   9.174907
3   1000    45.458094   9.174864    45.458157   9.174907
4   1000    45.458157   9.174864    45.458221   9.174907
5   1000    45.458221   9.174864    45.458285   9.174907
6   1000    45.458285   9.174864    45.458349   9.174907
7   1000    45.458349   9.174864    45.458413   9.174907
8   1000    45.458413   9.174864    45.458477   9.174907
9   1000    45.458477   9.174864    45.458540   9.174907
10  1000    45.458540   9.174864    45.458604   9.174907
11  1000    45.458604   9.174864    45.458668   9.174907
12  1000    45.458668   9.174864    45.458732   9.174907
13  1000    45.458732   9.174864    45.458796   9.174907
14  1000    45.458796   9.174864    45.458860   9.174907
15  1000    45.458860   9.174864    45.458923   9.174907
16  1000    45.458923   9.174864    45.458987   9.174907
17  1000    45.458987   9.174864    45.459051   9.174907
18  1000    45.459051   9.174864    45.459115   9.174907
19  1000    45.459115   9.174864    45.459179   9.174907
20  1000    45.459179   9.174864    45.459243   9.174907
21  1000    45.459243   9.174864    45.459306   9.174907
22  1000    45.459306   9.174864    45.459370   9.174907
23  1000    45.459370   9.174864    45.459434   9.174907
24  1000    45.459434   9.174864    45.459498   9.174907
25  1000    45.459498   9.174864    45.459562   9.174907
26  1000    45.459562   9.174864    45.459626   9.174907
27  1000    45.459626   9.174864    45.459689   9.174907
28  1000    45.459689   9.174864    45.459753   9.174907
29  1000    45.459753   9.174864    45.459817   9.174907
... ... ... ... ... ...
970 1000    45.460583   9.175545    45.460647   9.175587
971 1000    45.460647   9.175545    45.460711   9.175587
972 1000    45.460711   9.175545    45.460775   9.175587
973 1000    45.460775   9.175545    45.460838   9.175587
974 1000    45.460838   9.175545    45.460902   9.175587
975 1000    45.460902   9.175545    45.460966   9.175587
976 1000    45.460966   9.175545    45.461030   9.175587
977 1000    45.461030   9.175545    45.461094   9.175587
978 1000    45.461094   9.175545    45.461157   9.175587
979 1000    45.461157   9.175545    45.461221   9.175587
980 1000    45.461221   9.175545    45.461285   9.175587
981 1000    45.461285   9.175545    45.461349   9.175587
982 1000    45.461349   9.175545    45.461413   9.175587
983 1000    45.461413   9.175545    45.461477   9.175587
984 1000    45.461477   9.175545    45.461540   9.175587
985 1000    45.461540   9.175545    45.461604   9.175587
986 1000    45.461604   9.175545    45.461668   9.175587
987 1000    45.457966   9.175587    45.458030   9.175630
988 1000    45.458030   9.175587    45.458094   9.175630
989 1000    45.458094   9.175587    45.458157   9.175630
990 1000    45.458157   9.175587    45.458221   9.175630
991 1000    45.458221   9.175587    45.458285   9.175630
992 1000    45.458285   9.175587    45.458349   9.175630
993 1000    45.458349   9.175587    45.458413   9.175630
994 1000    45.458413   9.175587    45.458477   9.175630
995 1000    45.458477   9.175587    45.458540   9.175630
996 1000    45.458540   9.175587    45.458604   9.175630
997 1000    45.458604   9.175587    45.458668   9.175630
998 1000    45.458668   9.175587    45.458732   9.175630
999 1000    45.458732   9.175587    45.458796   9.175630
Run Code Online (Sandbox Code Playgroud)

它有22,000,000行×5列,有数据框"B",如下所示:

    type        Lat       Lng
0      0  45.465739  9.180830
1      2  45.463950  9.187113
2      1  45.468015  9.180648
3      1  45.462209  9.187447
4      0  45.459578  9.184007
5      1  45.459822  9.187034
6      2  45.454988  9.180310
7      2  45.459818  9.189377
8      0  45.462200  9.187440
9      0  45.467160  9.180100
10     2  45.459407  9.183300
11     2  45.457699  9.187434
12     1  45.455319  9.186697
13     0  45.461138  9.191943
14     2  45.456397  9.189028
15     0  45.457062  9.185878
16     1  45.461980  9.187024
17     1  45.464319  9.183142
18     2  45.464227  9.187065
19     1  45.460886  9.185216
Run Code Online (Sandbox Code Playgroud)

它有2,000,000行×3列.我想用"B"替换数据框"A"的类型值,其中:

A[latw]<B[lat]<A[late] and A[lngs]<B[lng]<B[lngn]
Run Code Online (Sandbox Code Playgroud)

我想检查一个位于B的位置属于A中的哪一个矩形.

PS我正在寻找python中最快的方法,比如使用并行处理.

piR*_*red 3

高级解释

这个答案相当复杂。以下是高层次要点

  • 与相对于 中排序纬度的np.searchsorted西纬度一起使用。对东纬度重复此操作,但参数为。这告诉我每个西/东纬度落在 中定义的频谱中的位置。 是索引值。因此,如果基于东部的结果大于基于西部的结果,则意味着西部和东部之间存在纬度。 ABside='right'Bnp.searchsortedB
    • 跟踪东部搜索量高于西部搜索量的区域。
    • 找出差异。我不仅关心向东搜索是否大于向西搜索,而且差异还告诉我B.Lat中间有多少个。我扩展数组来跟踪所有可能的匹配。
  • 对南经和北经重复此过程,但仅限于我们找到匹配纬度的子集。无需浪费精​​力寻找我们知道没有匹配项的地方。
    • 执行棘手的切片以展开位置
  • 使用numpy结构化数组查找二维数组的交集
  • 该最终交集具有来自 的序数位置A和来自 的匹配序数位置B
    • 记得我说过我们可以匹配 中的多行B,我选择第一行。

示范

使用A并由B@MaxU提供

p = process(A, B)
print(p)

[[3 0]
 [5 1]
 [9 2]]
Run Code Online (Sandbox Code Playgroud)

显示结果并进行比较。我会把它们并排放置

pd.concat([
        A.iloc[p[:, 0]].reset_index(drop=True),
        B.iloc[p[:, 1]].reset_index(drop=True)
    ], axis=1, keys=['A', 'B'])
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

但是,为了交换type值,我在process函数中放置了一个标志以重新分配。赶紧跑process(A, B, reassign_type=True)

计时
函数
我定义了返回相同内容的函数,并使其成为同类比较

def pir(A, B):
    return A.values[process(A, B)[:, 0], 0]

def maxu(A, B):
    return B.apply(
        lambda x: A.loc[
            (A.latw < x.Lat) &
            (x.Lat < A.late) &
            (A.lngs < x.Lng) &
            (x.Lng < A.lngn), 'type'
        ].head(1),
        axis=1
    ).values.diagonal()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

更彻底的测试

from timeit import timeit
from itertools import product

rng = np.arange(20, 51, 5)
idx = pd.MultiIndex.from_arrays([rng, rng], names=['B', 'A'])
functions = pd.Index(['pir', 'maxu'], name='method')

results = pd.DataFrame(index=idx, columns=functions)

for bi, ai in idx:
    n = ai
    ws = np.random.rand(n, 2) * 10 + np.array([[40, 30]])
    A = pd.DataFrame(
        np.hstack([-np.ones((n, 1)), ws, ws + np.random.rand(n, 2) * 2]),
        columns=['type', 'latw', 'lngs', 'late', 'lngn']
    )
    m = int(bi ** .5)
    prng = np.arange(m) / m * 10 + .5
    B = pd.DataFrame(
        np.array(list(product(prng, prng))) + np.array([[40, 30]]),
        columns=['Lat', "Lng"])
    B.insert(0, 'type', np.random.randint(3, size=len(B)))
    for j in functions:
        results.set_value(
            (bi, ai), j,
            timeit(
                '{}(A, B)'.format(j),
                'from __main__ import {}, A, B'.format(j),
                number=10
            )
        )

results.plot(rot=45)
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


代码

from collections import namedtuple
TwoSided = namedtuple('TwoSided', ['match', 'idx', 'found'])

def two_sided_search(left, right, a):
    # sort `a` and save order for later
    argsort = a.argsort()
    asorted = a[argsort]

    # where does `left` fall in `asorted`
    s_left = asorted.searchsorted(left)
    # where does `right` fall in `asorted`
    s_right = asorted.searchsorted(right, side='right')

    rng = np.arange(left.size)

    # a match happens when where we found `left`, `right` was found later
    match_idx = rng[s_left < s_right]

    # ignore positions with no match
    left_pos = s_left[match_idx]
    right_pos = s_right[match_idx]
    # distance from where left was found to where right was found
    # this represents how many items in a are wrapped by left and right
    diff_pos = right_pos - left_pos
    # build an index on `match_idx` paired with all positions in
    # `asorted`.
    d2 = left_pos - np.append(0, diff_pos[:-1].cumsum())
    p1 = np.arange(len(left_pos)).repeat(diff_pos)
    p2 = np.arange(diff_pos.sum()) + d2.repeat(diff_pos)

    # returns
    # `match_idx` which is an integer slice of either `left` or `right`
    # for each element in `match_idx` there are one or more elements in
    # `a` that were surrounded by `left` and `right`
    # `p1` are the positions in `match_idx` that are paired with `argsort[p2]`
    # for example, consider `p1 = [1, 1, 2, 2]` and `argsort[p2] = [3, 5, 12, 5]`
    # this means that `left[match_idx[1]] < a[3] < right[match_idx[1]]`
    # and `left[match_idx[1]] < a[5] < right[match_idx[1]]`
    # and `left[match_idx[2]] < a[12] < right[match_idx[2]]`
    # and `left[match_idx[2]] < a[5] < right[match_idx[2]]`

    return TwoSided(match_idx, p1, argsort[p2])

def intersectNd(a, b):
    # taken from /sf/answers/582218241/
    # returns the interesection of 2 2-D arrays
    # the strategy is to convert to a structured array then
    # use np.intersect1d then convert back to unstructured
    ncols = a.shape[1]
    dtype = dict(
        names=['f{}'.format(i) for i in range(ncols)],
        formats=ncols * [a.dtype]
    )

    return np.intersect1d(
        a.view(dtype), b.view(dtype)
    ).view(a.dtype).reshape(-1, ncols)

def where_in(b1, b2):
    # return a mask of lenght len(b2) where True
    # when element of b2 is in b1
    bsort = b1.argsort()
    bsrtd = b1[bsort]
    bsrch = bsrtd.searchsorted(b2) % bsrtd.size
    return bsrtd[bsrch] == b2

def process(A, B, reassign_type=False):
    lats = two_sided_search(
        A.latw.values,
        A.late.values,
        B.Lat.values,
    )

    # subset A & B
    # no point looking for longitude matches
    # in rows without a latitude match
    lngs = A.lngs.values[lats.match]
    lngn = A.lngn.values[lats.match]
    u, i = np.unique(lats.found, return_index=1)
    lng = B.Lng.values[u]

    lons = two_sided_search(
        lngs, lngn, lng
    )

    # `lats.match` is where we found successful latitudes
    # `lons.match` is where we found successful longitudes
    # since `lons.match` was based on `lats.match` we can slice
    # to get the original positions of `A` that satisfy both
    # the latw < blat < late & lngs < blng < lngn
    match = lats.match[lons.match]

    # however, for every row in A that might be a match for B,
    # there may be multilple B's
    #
    # We start addressing this by finding
    # where in lats.idx do we have matches
    # this gives me a boolean array of lats.match
    # index values repeated for each row in B
    # that satisfies our conditions
    wi = where_in(lons.match, lats.idx)

    # a (? x 2) array representing all combinations of
    # rows in A that matched the 
    lon_pos = np.hstack([
            lats.match[lons.match[lons.idx], None],
            u[lons.found, None]
        ])

    lat_pos = np.hstack([
            lats.match[lats.idx[wi], None],
            lats.found[wi, None]
        ])

    x = intersectNd(lat_pos, lon_pos)
    p, pi = np.unique(x[:, 0], return_index=1)

    if reassign_type:
        A.iloc[x[pi, 0], A.columns.get_loc('type')] = \
            B.iloc[x[pi, 1], B.columns.get_loc('type')].values
    else:
        return x[pi]
Run Code Online (Sandbox Code Playgroud)