获取两个2D numpy数组的相交行

Kar*_*hik 33 python numpy

我想在两个2D numpy数组中获得相交(公共)行.例如,如果以下数组作为输入传递:

array([[1, 4],
       [2, 5],
       [3, 6]])

array([[1, 4],
       [3, 6],
       [7, 8]])
Run Code Online (Sandbox Code Playgroud)

输出应该是:

array([[1, 4],
       [3, 6])
Run Code Online (Sandbox Code Playgroud)

我知道如何用循环来做这件事.我正在寻找一种Pythonic/Numpy方式来做到这一点.

Joe*_*ton 31

对于短数组,使用集合可能是最清晰,最易读的方法.

另一种方法是使用numpy.intersect1d.你必须欺骗它将行视为单个值,但是......这使得事情的可读性降低了......

import numpy as np

A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])

nrows, ncols = A.shape
dtype={'names':['f{}'.format(i) for i in range(ncols)],
       'formats':ncols * [A.dtype]}

C = np.intersect1d(A.view(dtype), B.view(dtype))

# This last bit is optional if you're okay with "C" being a structured array...
C = C.view(A.dtype).reshape(-1, ncols)
Run Code Online (Sandbox Code Playgroud)

对于大型数组,这应该比使用集合快得多.

  • 实际上,不,它不会起作用.(我之前意识到,然后忘了它!)没有结构化的dtype,它不会将事物视为行,只考虑"原始"数字.考虑像'A = np.array([[4,1],[2,5],[3,6]])和`B = np.array([[1,4],[3,6] ],[7,8]])`. (3认同)
  • `np.intersect1d(a, b).reshape(-1, ncols)` 会达到相同的结果吗? (2认同)
  • 您可以用以下代码替换该dtype行:`dtype =(','.join([str(A.dtype)]*ncols))`.未指定名称,因此默认为f0,f1等. (2认同)

mtr*_*trw 14

你可以使用Python的集合:

>>> import numpy as np
>>> A = np.array([[1,4],[2,5],[3,6]])
>>> B = np.array([[1,4],[3,6],[7,8]])
>>> aset = set([tuple(x) for x in A])
>>> bset = set([tuple(x) for x in B])
>>> np.array([x for x in aset & bset])
array([[1, 4],
       [3, 6]])
Run Code Online (Sandbox Code Playgroud)

正如Rob Cowie指出的那样,这可以更简洁地完成

np.array([x for x in set(tuple(x) for x in A) & set(tuple(x) for x in B)])
Run Code Online (Sandbox Code Playgroud)

可能有一种方法可以做到这一点,而不是从数组到元组的所有来回,但它现在不会来找我.

  • 我同意.无法找到任何'原生'的numpy方式.单行版本可能是`common = set(在A中为i的元组(i))&set(在B中为i表示元组(i)) (2认同)
  • 如果要使用 set,可以使用交集函数:set.intersection( aset, bset) (2认同)

Shu*_*rma 11

Numpy broadcasting

我们可以使用广播创建一个布尔掩码,然后可以使用A它来过滤数组中也存在于数组中的行B

A = np.array([[1,4],[2,5],[3,6]])
B = np.array([[1,4],[3,6],[7,8]])

m = (A[:, None] == B).all(-1).any(1)
Run Code Online (Sandbox Code Playgroud)
>>> A[m]

array([[1, 4],
       [3, 6]])
Run Code Online (Sandbox Code Playgroud)


Vas*_*dis 6

我无法理解为什么没有建议纯粹的numpy方式让这个工作.所以我发现了一个使用numpy广播的.基本思想是通过轴交换将其中一个数组转换为3d.让我们构造2个数组:

a=np.random.randint(10, size=(5, 3))
b=np.zeros_like(a)
b[:4,:]=a[np.random.randint(a.shape[0], size=4), :]
Run Code Online (Sandbox Code Playgroud)

随着我的运行它给了:

a=array([[5, 6, 3],
   [8, 1, 0],
   [2, 1, 4],
   [8, 0, 6],
   [6, 7, 6]])
b=array([[2, 1, 4],
   [2, 1, 4],
   [6, 7, 6],
   [5, 6, 3],
   [0, 0, 0]])
Run Code Online (Sandbox Code Playgroud)

步骤是(数组可以互换):

#a is nxm and b is kxm
c = np.swapaxes(a[:,:,None],1,2)==b #transform a to nx1xm
# c has nxkxm dimensions due to comparison broadcast
# each nxixj slice holds comparison matrix between a[j,:] and b[i,:]
# Decrease dimension to nxk with product:
c = np.prod(c,axis=2)
#To get around duplicates://
# Calculate cumulative sum in k-th dimension
c= c*np.cumsum(c,axis=0)
# compare with 1, so that to get only one 'True' statement by row
c=c==1
#//
# sum in k-th dimension, so that a nx1 vector is produced
c=np.sum(c,axis=1).astype(bool)
# The intersection between a and b is a[c]
result=a[c]
Run Code Online (Sandbox Code Playgroud)

在2行中用于减少使用的内存(如果错误则纠正我):

def array_row_intersection(a,b):
   tmp=np.prod(np.swapaxes(a[:,:,None],1,2)==b,axis=2)
   return a[np.sum(np.cumsum(tmp,axis=0)*tmp==1,axis=1).astype(bool)]
Run Code Online (Sandbox Code Playgroud)

这给了我的例子结果:

result=array([[5, 6, 3],
       [2, 1, 4],
       [6, 7, 6]])
Run Code Online (Sandbox Code Playgroud)

这比设置解决方案更快,因为它只使用简单的numpy操作,同时它不断减小尺寸,是两个大矩阵的理想选择.我想我可能在评论中犯了错误,因为我通过实验和本能得到答案.可以通过转置数组或稍微更改步骤来找到列交集的等效项.此外,如果需要重复项,则必须跳过"//"内的步骤.可以编辑该函数以仅返回索引的布尔数组,这对我来说很方便,同时尝试使用相同的向量获取不同的数组索引.投票答案和我的答案的基准(每个维度中的元素数量对选择的内容起作用):

码:

def voted_answer(A,B):
    nrows, ncols = A.shape
    dtype={'names':['f{}'.format(i) for i in range(ncols)],
           'formats':ncols * [A.dtype]}
    C = np.intersect1d(A.view(dtype), B.view(dtype))
    return C.view(A.dtype).reshape(-1, ncols)

a_small=np.random.randint(10, size=(10, 10))
b_small=np.zeros_like(a_small)
b_small=a_small[np.random.randint(a_small.shape[0],size=[a_small.shape[0]]),:]
a_big_row=np.random.randint(10, size=(10, 1000))
b_big_row=a_big_row[np.random.randint(a_big_row.shape[0],size=[a_big_row.shape[0]]),:]
a_big_col=np.random.randint(10, size=(1000, 10))
b_big_col=a_big_col[np.random.randint(a_big_col.shape[0],size=[a_big_col.shape[0]]),:]
a_big_all=np.random.randint(10, size=(100,100))
b_big_all=a_big_all[np.random.randint(a_big_all.shape[0],size=[a_big_all.shape[0]]),:]



print 'Small arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_small,b_small),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_small,b_small),number=100)/100
print 'Big column arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_col,b_big_col),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_col,b_big_col),number=100)/100
print 'Big row arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_row,b_big_row),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_row,b_big_row),number=100)/100
print 'Big arrays:'
print '\t Voted answer:',timeit.timeit(lambda:voted_answer(a_big_all,b_big_all),number=100)/100
print '\t Proposed answer:',timeit.timeit(lambda:array_row_intersection(a_big_all,b_big_all),number=100)/100
Run Code Online (Sandbox Code Playgroud)

结果:

Small arrays:
     Voted answer: 7.47108459473e-05
     Proposed answer: 2.47001647949e-05
Big column arrays:
     Voted answer: 0.00198730945587
     Proposed answer: 0.0560171294212
Big row arrays:
     Voted answer: 0.00500325918198
     Proposed answer: 0.000308241844177
Big arrays:
     Voted answer: 0.000864889621735
     Proposed answer: 0.00257176160812
Run Code Online (Sandbox Code Playgroud)

根据判决,如果你必须比较2个点的2个大2d阵列,那么使用投票答案.如果你有各个方面的大矩阵,那么投票的答案绝对是最好的.所以,这取决于你每次选择的内容.


小智 5

使用结构化数组实现此目的的另一种方法:

>>> a = np.array([[3, 1, 2], [5, 8, 9], [7, 4, 3]])
>>> b = np.array([[2, 3, 0], [3, 1, 2], [7, 4, 3]])
>>> av = a.view([('', a.dtype)] * a.shape[1]).ravel()
>>> bv = b.view([('', b.dtype)] * b.shape[1]).ravel()
>>> np.intersect1d(av, bv).view(a.dtype).reshape(-1, a.shape[1])
array([[3, 1, 2],
       [7, 4, 3]])
Run Code Online (Sandbox Code Playgroud)

为了清楚起见,结构化视图如下所示:

>>> a.view([('', a.dtype)] * a.shape[1])
array([[(3, 1, 2)],
       [(5, 8, 9)],
       [(7, 4, 3)]],
       dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
Run Code Online (Sandbox Code Playgroud)