end*_*ith 16 python numpy combinatorics python-itertools
我想为numpy 实现itertools.combinations.根据这个讨论,我有一个适用于一维输入的功能:
def combs(a, r):
"""
Return successive r-length combinations of elements in the array a.
Should produce the same output as array(list(combinations(a, r))), but
faster.
"""
a = asarray(a)
dt = dtype([('', a.dtype)]*r)
b = fromiter(combinations(a, r), dt)
return b.view(a.dtype).reshape(-1, r)
Run Code Online (Sandbox Code Playgroud)
并且输出有意义:
In [1]: list(combinations([1,2,3], 2))
Out[1]: [(1, 2), (1, 3), (2, 3)]
In [2]: array(list(combinations([1,2,3], 2)))
Out[2]:
array([[1, 2],
[1, 3],
[2, 3]])
In [3]: combs([1,2,3], 2)
Out[3]:
array([[1, 2],
[1, 3],
[2, 3]])
Run Code Online (Sandbox Code Playgroud)
但是,最好将其扩展为ND输入,其中额外的维度只允许您一次快速地执行多个调用.因此,从概念上讲,如果combs([1, 2, 3], 2)生成[1, 2], [1, 3], [2, 3]和 combs([4, 5, 6], 2)生成[4, 5], [4, 6], [5, 6],combs((1,2,3) and (4,5,6), 2)则应生成[1, 2], [1, 3], [2, 3] and [4, 5], [4, 6], [5, 6]"和"只表示平行行或列(无论哪个有意义).(同样适用于其他尺寸)
我不确定:
axis=参数,以及默认的轴0.因此,轴0应该是我正在组合的那个,并且所有其他轴只代表并行计算?)ValueError: setting an array element with a sequence.)dt = dtype([('', a.dtype)]*r)吗?HYR*_*YRY 15
您可以使用itertools.combinations()创建索引数组,然后使用NumPy的花式索引:
import numpy as np
from itertools import combinations, chain
from scipy.special import comb
def comb_index(n, k):
count = comb(n, k, exact=True)
index = np.fromiter(chain.from_iterable(combinations(range(n), k)),
int, count=count*k)
return index.reshape(-1, k)
data = np.array([[1,2,3,4,5],[10,11,12,13,14]])
idx = comb_index(5, 3)
print(data[:, idx])
Run Code Online (Sandbox Code Playgroud)
输出:
[[[ 1 2 3]
[ 1 2 4]
[ 1 2 5]
[ 1 3 4]
[ 1 3 5]
[ 1 4 5]
[ 2 3 4]
[ 2 3 5]
[ 2 4 5]
[ 3 4 5]]
[[10 11 12]
[10 11 13]
[10 11 14]
[10 12 13]
[10 12 14]
[10 13 14]
[11 12 13]
[11 12 14]
[11 13 14]
[12 13 14]]]
Run Code Online (Sandbox Code Playgroud)
当 时r = k = 2,您还可以使用numpy.triu_indices(n, 1)哪些索引矩阵的上三角。
idx = comb_index(5, 2)
Run Code Online (Sandbox Code Playgroud)
来自HYRY 的回答相当于
idx = np.transpose(np.triu_indices(5, 1))
Run Code Online (Sandbox Code Playgroud)
但是是内置的,对于高于 ~20 的 N,速度要快几倍:
timeit comb_index(1000, 2)
32.3 ms ± 443 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
timeit np.transpose(np.triu_indices(1000, 1))
10.2 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Run Code Online (Sandbox Code Playgroud)
np.triu_indices我已经k = 2使用perfplot. 毫无疑问,获胜者是,np.triu_indices我现在看到,np.dtype([('', np.intp)] * 2)即使对于igraph.EdgeList.
from itertools import combinations, chain
from scipy.special import comb
import igraph as ig #graph library build on C
import networkx as nx #graph library, pure Python
def _combs(n):
return np.array(list(combinations(range(n),2)))
def _combs_fromiter(n): #@Jaime
indices = np.arange(n)
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(combinations(indices, 2), dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _combs_fromiterplus(n):
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(combinations(range(n), 2), dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _numpy(n): #@endolith
return np.transpose(np.triu_indices(n,1))
def _igraph(n):
return np.array(ig.Graph(n).complementer(False).get_edgelist())
def _igraph_fromiter(n):
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(ig.Graph(n).complementer(False).get_edgelist(), dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _nx(n):
G = nx.Graph()
G.add_nodes_from(range(n))
return np.array(list(nx.complement(G).edges))
def _nx_fromiter(n):
G = nx.Graph()
G.add_nodes_from(range(n))
dt = np.dtype([('', np.intp)]*2)
indices = np.fromiter(nx.complement(G).edges, dt)
indices = indices.view(np.intp).reshape(-1, 2)
return indices
def _comb_index(n): #@HYRY
count = comb(n, 2, exact=True)
index = np.fromiter(chain.from_iterable(combinations(range(n), 2)),
int, count=count*2)
return index.reshape(-1, 2)
fig = plt.figure(figsize=(15, 10))
plt.grid(True, which="both")
out = perfplot.bench(
setup = lambda x: x,
kernels = [_numpy, _combs, _combs_fromiter, _combs_fromiterplus,
_comb_index, _igraph, _igraph_fromiter, _nx, _nx_fromiter],
n_range = [2 ** k for k in range(12)],
xlabel = 'combinations(n, 2)',
title = 'testing combinations',
show_progress = False,
equality_check = False)
out.show()
Run Code Online (Sandbox Code Playgroud)
想知道为什么np.triu_indices不能扩展到更多维度?
triu_indices在此处实现)= 高达 2 倍的加速np.triu_indices 实际上可能是案例的赢家k = 3,即使k = 4我们实现了一个通用的方法。此方法的当前版本等效于:
def triu_indices(n, k):
x = np.less.outer(np.arange(n), np.arange(-k+1, n-k+1))
return np.nonzero(x)
Run Code Online (Sandbox Code Playgroud)
它为两个序列 0,1,...,n-1构造关系x < y 的矩阵表示,并找到它们不为零的单元格的位置。对于 3D 情况,我们需要添加额外的维度和相交关系x < y和y < z。对于下一个维度过程是相同的,但这会导致巨大的内存过载,因为需要 n^k 个二进制单元,并且其中只有 C(n, k) 获得 True 值。内存使用量和性能以 O(n!) 增长,因此该算法itertools.combinations仅在 k 值较小时表现更好。这最好用于实际案例k=2和k=3
def C(n, k): #huge memory overload...
if k==0:
return np.array([])
if k==1:
return np.arange(1,n+1)
elif k==2:
return np.less.outer(np.arange(n), np.arange(n))
else:
x = C(n, k-1)
X = np.repeat(x[None, :, :], len(x), axis=0)
Y = np.repeat(x[:, :, None], len(x), axis=2)
return X&Y
def C_indices(n, k):
return np.transpose(np.nonzero(C(n,k)))
Run Code Online (Sandbox Code Playgroud)
让我们用perfplot结帐:
import matplotlib.pyplot as plt
import numpy as np
import perfplot
from itertools import chain, combinations
from scipy.special import comb
def C(n, k): # huge memory overload...
if k == 0:
return np.array([])
if k == 1:
return np.arange(1, n + 1)
elif k == 2:
return np.less.outer(np.arange(n), np.arange(n))
else:
x = C(n, k - 1)
X = np.repeat(x[None, :, :], len(x), axis=0)
Y = np.repeat(x[:, :, None], len(x), axis=2)
return X & Y
def C_indices(data):
n, k = data
return np.transpose(np.nonzero(C(n, k)))
def comb_index(data):
n, k = data
count = comb(n, k, exact=True)
index = np.fromiter(chain.from_iterable(combinations(range(n), k)),
int, count=count * k)
return index.reshape(-1, k)
def build_args(k):
return {'setup': lambda x: (x, k),
'kernels': [comb_index, C_indices],
'n_range': [2 ** x for x in range(2, {2: 10, 3:10, 4:7, 5:6}[k])],
'xlabel': f'N',
'title': f'test of case C(N,{k})',
'show_progress': True,
'equality_check': lambda x, y: np.array_equal(x, y)}
outs = [perfplot.bench(**build_args(n)) for n in (2, 3, 4, 5)]
fig = plt.figure(figsize=(20, 20))
for i in range(len(outs)):
ax = fig.add_subplot(2, 2, i + 1)
ax.grid(True, which="both")
outs[i].plot()
plt.show()
Run Code Online (Sandbox Code Playgroud)
因此,实现了最佳性能提升k=2(相当于np.triu_indices) and for k=3`,它几乎快了两倍。
numpy_combinations在此处实现)= 高达 2.5 倍的加速在这个问题之后(感谢@Divakar),我设法找到了一种基于前一列和帕斯卡三角形计算特定列值的方法。它尚未尽可能优化,但结果确实很有希望。开始了:
from scipy.linalg import pascal
def stretch(a, k):
l = a.sum()+len(a)*(-k)
out = np.full(l, -1, dtype=int)
out[0] = a[0]-1
idx = (a-k).cumsum()[:-1]
out[idx] = a[1:]-1-k
return out.cumsum()
def numpy_combinations(n, k):
#n, k = data #benchmark version
n, k = data
x = np.array([n])
P = pascal(n).astype(int)
C = []
for b in range(k-1,-1,-1):
x = stretch(x, b)
r = P[b][x - b]
C.append(np.repeat(x, r))
return n - 1 - np.array(C).T
Run Code Online (Sandbox Code Playgroud)
基准测试结果是:
# script is the same as in previous example except this part
def build_args(k):
return {'setup': lambda x: (k, x),
'kernels': [comb_index, numpy_combinations],
'n_range': [x for x in range(1, k)],
'xlabel': f'N',
'title': f'test of case C({k}, k)',
'show_progress': True,
'equality_check': False}
outs = [perfplot.bench(**build_args(n)) for n in (12, 15, 17, 23, 25, 28)]
fig = plt.figure(figsize=(20, 20))
for i in range(len(outs)):
ax = fig.add_subplot(2, 3, i + 1)
ax.grid(True, which="both")
outs[i].plot()
plt.show()
Run Code Online (Sandbox Code Playgroud)
尽管它仍然不能打itertools.combinations的n < 15,但它是在其他情况下,一个新的赢家。最后但并非最不重要的一点是,numpy当组合数量变得非常大时,展示它的威力。它能够在处理 C(28, 14) 个组合时存活下来,大约 40'000'000 个大小为 14 的项目