相交并排序两个numpy数组的索引

Joe*_*ntz 8 python sorting optimization numpy set-intersection

我有两个numpy整数数组,长度都是几亿.在每个数组中,值是唯一的,并且每个值最初都是未排序的.

我希望每个索引产生它们的排序交集.例如:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
Run Code Online (Sandbox Code Playgroud)

然后这些排序的交集是[4, 5, 11],所以我们想要将x和y中的每一个转换成该数组的索引,所以我们希望它返回:

mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])
Run Code Online (Sandbox Code Playgroud)

自那时候起 x[mx] == y[my] == np.intersect1d(x, y)

到目前为止,我们唯一的解决方案涉及三种不同的方法,因此似乎不太可能是最优的.

每个值代表一个星系,以防这个问题变得更有趣.

Ror*_*rke 3

这是一个基于 的实现的选项intersect1d,相当简单。它需要一次调用argsort.

公认的简单测试通过了。

import numpy as np


def my_intersect(x, y):
    """my_intersect(x, y) -> xm, ym
    x, y: 1-d arrays of unique values
    xm, ym: indices into x and y giving sorted intersection
    """
    # basic idea taken from numpy.lib.arraysetops.intersect1d
    aux = np.concatenate((x, y))
    sidx = aux.argsort()
    # Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
    inidx = aux[sidx[1:]] == aux[sidx[:-1]]

    # quicksort is not stable, so must do some work to extract indices
    # (if stable, sidx[inidx.nonzero()]  would be for x)
    # interlace the two sets of indices, and check against lengths
    xym = np.vstack((sidx[inidx.nonzero()],
                     sidx[1:][inidx.nonzero()])).T.flatten()

    xm = xym[xym < len(x)]
    ym = xym[xym >= len(x)] - len(x)

    return xm, ym


def check_my_intersect(x, y):
    mx, my = my_intersect(x, y)
    assert (x[mx] == np.intersect1d(x, y)).all()

    # not really necessary: np.intersect1d returns a sorted list
    assert (x[mx] == sorted(x[mx])).all()
    assert (x[mx] == y[my]).all()


def random_unique_unsorted(n):
    while True:
        x = np.unique(np.random.randint(2*n, size=n))
        if len(x):
            break
    np.random.shuffle(x)
    return x


x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

check_my_intersect(x, y)


for i in range(20):
    x = random_unique_unsorted(100+i)
    y = random_unique_unsorted(200+i)
    check_my_intersect(x, y)
Run Code Online (Sandbox Code Playgroud)

编辑:“Note”注释令人困惑(用作...语音省略号,忘了它也是一个Python运算符)。