2-D矩阵:查找和删除作为其他列子集的列

Ton*_*ass 6 python numpy matrix vectorization scipy

我有一个问题要在逻辑矩阵中标识和删除作为其他列子集的列。即[1,0,1]是[1,1,1]的子集;但是[1,1,0]和[0,1,1]都不是彼此的子集。我写了一段快速的代码来标识作为子集的列,并使用一对嵌套的for循环进行(n ^ 2-n)/ 2检查。

import numpy as np
A = np.array([[1, 0, 0, 0, 0, 1],
              [0, 1, 1, 1, 1, 0],
              [1, 0, 1, 0, 1, 1],
              [1, 1, 0, 1, 0, 1],
              [1, 1, 0, 1, 0, 0],
              [1, 0, 0, 0, 0, 0],
              [0, 0, 1, 1, 1, 0],
              [0, 0, 1, 0, 1, 0]])
rows,cols = A.shape
columns = [True]*cols
for i in range(cols):
    for j in range(i+1,cols):
        diff = A[:,i]-A[:,j]
        if all(diff >= 0):
            print "%d is a subset of %d" % (j, i)
            columns[j] = False
        elif all(diff <= 0):
            print "%d is a subset of %d" % (i, j)
            columns[i] = False
B = A[:,columns]
Run Code Online (Sandbox Code Playgroud)

解决方案应该是

>>> print B
[[1 0 0]
 [0 1 1]
 [1 1 0]
 [1 0 1]
 [1 0 1]
 [1 0 0]
 [0 1 1]
 [0 1 0]]
Run Code Online (Sandbox Code Playgroud)

但是对于大型矩阵,我敢肯定有一种方法可以更快地完成此任务。一种想法是在执行过程中消除子集列,这样我就不会检查已知是子集的列。另一个想法是将其向量化,因此没有O(n ^ 2)运算。谢谢。

Ton*_*ass 1

由于A我实际处理的矩阵是 5000x5000 并且稀疏,密度约为 4%,因此我决定尝试结合 Python 的“集合”对象的稀疏矩阵方法。总的来说,它比我原来的解决方案快得多,但我觉得我从矩阵A到集合列表的过程D并没有那么快。任何有关如何更好地做到这一点的想法都将受到赞赏。

解决方案

import numpy as np

A = np.array([[1, 0, 0, 0, 0, 1],
              [0, 1, 1, 1, 1, 0],
              [1, 0, 1, 0, 1, 1],
              [1, 1, 0, 1, 0, 1],
              [1, 1, 0, 1, 0, 0],
              [1, 0, 0, 0, 0, 0],
              [0, 0, 1, 1, 1, 0],
              [0, 0, 1, 0, 1, 0]])

rows,cols = A.shape
drops = np.zeros(cols).astype(bool)

# sparse nonzero elements
C = np.nonzero(A)

# create a list of sets containing the indices of non-zero elements of each column
D = [set() for j in range(cols)]
for i in range(len(C[0])):
    D[C[1][i]].add(C[0][i])

# find subsets, ignoring columns that are known to already be subsets
for i in range(cols):
    if drops[i]==True:
        continue
    col1 = D[i]
    for j in range(i+1,cols):
        col2 = D[j]
        if col2.issubset(col1):
            # I tried `if drops[j]==True: continue` here, but that was slower
            print "%d is a subset of %d" % (j, i)
            drops[j] = True
        elif col1.issubset(col2):
            print "%d is a subset of %d" % (i, j)
            drops[i] = True
            break

B = A[:, ~drops]
print B
Run Code Online (Sandbox Code Playgroud)