Python Numpy向量化嵌套的for循环用于组合

Ton*_*ass 6 python numpy matrix vectorization combinatorics

给定一个nxn实数数组A,我试图找到2-d数组三行的所有组合的按元素最小值的最大值中的最小值。使用for循环,结果如下所示:

import numpy as np

n = 100
np.random.seed(2)
A = np.random.rand(n,n)
global_best = np.inf

for i in range(n-2):
    for j in range(i+1, n-1):
        for k in range(j+1, n):
            # find the maximum of the element-wise minimum of the three vectors
            local_best = np.amax(np.array([A[i,:], A[j,:], A[k,:]]).min(0))
            # if local_best is lower than global_best, update global_best
            if (local_best < global_best):
                global_best = local_best
                save_rows = [i, j, k]

print global_best, save_rows
Run Code Online (Sandbox Code Playgroud)

对于n = 100,输出应为:

Out[]: 0.492652949593 [6, 41, 58]
Run Code Online (Sandbox Code Playgroud)

我有一种感觉,尽管我可以使用Numpy向量化更快地完成此工作,并且一定会感谢您对此提供的帮助。谢谢。

Joh*_*nck 6

此解决方案的速度提高了 5 倍n=100

coms = np.fromiter(itertools.combinations(np.arange(n), 3), 'i,i,i').view(('i', 3))
best = A[coms].min(1).max(1)
at = best.argmin()
global_best = best[at]
save_rows = coms[at]
Run Code Online (Sandbox Code Playgroud)

第一行有点复杂,但将结果转换itertools.combinations为包含所有可能的索引组合的 NumPy 数组[i,j,k]

A从那里开始,使用所有可能的索引组合进行索引,然后沿适当的轴减少就很简单了。

该解决方案会消耗更多内存,因为它构建了所有可能组合的具体数组A[coms]。对于较小的n,比如 250 以下的,它可以节省时间,但对于较大的,n内存流量会非常高,并且可能比原始代码慢。


Ori*_*ril 5

按块进行工作可以结合矢量化微积分的速度,同时避免遇到内存错误。下面是一个将嵌套循环转换为按块矢量化的示例。

\n\n

从与问题相同的变量开始,定义块长度,以便对块内的计算进行矢量化,并仅在块上循环而不是在组合上循环。

\n\n
chunk = 2000 # define chunk length, if to small, the code won\'t take advantage \n             # of vectorization, if it is too large, excessive memory usage will \n             # slow down execution, or Memory Error will be risen \ncombinations = itertools.combinations(range(n),3) # generate iterator containing \n                                        # all possible combinations of 3 columns\nN = n*(n-1)*(n-2)//6 # number of combinations (length of combinations cannot be \n                     # retrieved because it is an iterator)\n# generate a list containing how many elements of combinations will be retrieved \n# per iteration\nn_chunks, remainder = divmod(N,chunk)\ncounts_list = [chunk for _ in range(n_chunks)]\nif remainder:\n    counts_list.append(remainder)\n\n# Iterate one chunk at a time, using vectorized code to treat the chunk\nfor counts in counts_list:\n    # retrieve combinations in current chunk\n    current_comb = np.fromiter(combinations,dtype=\'i,i,i\',count=counts)\\\n                     .view((\'i\',3)) \n    # maximum of element-wise minimum in current chunk\n    chunk_best = np.minimum(np.minimum(A[current_comb[:,0],:],A[current_comb[:,1],:]),\n                            A[current_comb[:,2],:]).max(axis=1) \n    ravel_save_row = chunk_best.argmin() # minimum of maximums in current chunk\n    # check if current chunk contains global minimum\n    if chunk_best[ravel_save_row] < global_best: \n        global_best = chunk_best[ravel_save_row]\n        save_rows = current_comb[ravel_save_row]\nprint(global_best,save_rows)\n
Run Code Online (Sandbox Code Playgroud)\n\n

我与嵌套循环进行了一些性能比较,获得以下结果(chunk_length = 1000):

\n\n
    \n
  • n=100\n\n
      \n
    • 嵌套循环:1.13 s \xc2\xb1 16.6 ms
    • \n
    • 按块工作: 108 ms \xc2\xb1 565 \xc2\xb5s
    • \n
  • \n
  • n=150\n\n
      \n
    • 嵌套循环:4.16 s \xc2\xb1 39.3 ms
    • \n
    • 按块工作: 523 ms \xc2\xb1 4.75 ms
    • \n
  • \n
  • n=500\n\n
      \n
    • 嵌套循环:3min 18s \xc2\xb1 3.21 s
    • \n
    • 按块工作:1min 12s \xc2\xb1 1.6 s
    • \n
  • \n
\n\n

笔记

\n\n

对代码进行分析后,我发现np.min调用. 我直接将其转换为这提高了性能一点。np.maximum.reducenp.maximum

\n


max*_*111 2

不要尝试对不易矢量化的循环进行矢量化。请改用 Numba 等 jit 编译器或使用 Cython。如果生成的代码更具可读性,则矢量化解决方案很好,但就性能而言,编译的解决方案通常更快,或者在最坏的情况下与矢量化解决方案一样快(BLAS 例程除外)。

单线程示例

import numba as nb
import numpy as np

#Min and max library calls may be costly for only 3 values
@nb.njit()
def max_min_3(A,B,C):
  max_of_min=-np.inf
  for i in range(A.shape[0]):
    loc_min=A[i]
    if (B[i]<loc_min):
      loc_min=B[i]
    if (C[i]<loc_min):
      loc_min=C[i]

    if (max_of_min<loc_min):
      max_of_min=loc_min

  return max_of_min

@nb.njit()
def your_func(A):
  n=A.shape[0]
  save_rows=np.zeros(3,dtype=np.uint64)
  global_best=np.inf
  for i in range(n):
      for j in range(i+1, n):
          for k in range(j+1, n):
              # find the maximum of the element-wise minimum of the three vectors
              local_best = max_min_3(A[i,:], A[j,:], A[k,:])
              # if local_best is lower than global_best, update global_best
              if (local_best < global_best):
                  global_best = local_best
                  save_rows[0] = i
                  save_rows[1] = j
                  save_rows[2] = k

  return global_best, save_rows
Run Code Online (Sandbox Code Playgroud)

单线程版本的性能

n=100
your_version: 1.56s
compiled_version: 0.0168s (92x speedup)

n=150
your_version: 5.41s
compiled_version: 0.08122s (66x speedup)

n=500
your_version: 283s
compiled_version: 8.86s (31x speedup)
Run Code Online (Sandbox Code Playgroud)

第一次调用的开销约为 0.3-1 秒。对于计算时间本身的性能测量,调用一次,然后测量性能。

通过一些代码更改,此任务也可以并行化。

多线程示例

@nb.njit(parallel=True)
def your_func(A):
  n=A.shape[0]
  all_global_best=np.inf
  rows=np.empty((3),dtype=np.uint64)

  save_rows=np.empty((n,3),dtype=np.uint64)
  global_best_Temp=np.empty((n),dtype=A.dtype)
  global_best_Temp[:]=np.inf

  for i in range(n):
      for j in nb.prange(i+1, n):
          row_1=0
          row_2=0
          row_3=0
          global_best=np.inf
          for k in range(j+1, n):
              # find the maximum of the element-wise minimum of the three vectors

              local_best = max_min_3(A[i,:], A[j,:], A[k,:])
              # if local_best is lower than global_best, update global_best
              if (local_best < global_best):
                  global_best = local_best
                  row_1 = i
                  row_2 = j
                  row_3 = k

          save_rows[j,0]=row_1
          save_rows[j,1]=row_2
          save_rows[j,2]=row_3
          global_best_Temp[j]=global_best

      ind=np.argmin(global_best_Temp)
      if (global_best_Temp[ind]<all_global_best):
          rows[0] = save_rows[ind,0]
          rows[1] = save_rows[ind,1]
          rows[2] = save_rows[ind,2]
          all_global_best=global_best_Temp[ind]

  return all_global_best, rows
Run Code Online (Sandbox Code Playgroud)

多线程版本的性能

n=100
your_version: 1.56s
compiled_version: 0.0078s (200x speedup)

n=150
your_version: 5.41s
compiled_version: 0.0282s (191x speedup)

n=500
your_version: 283s
compiled_version: 2.95s (96x speedup)
Run Code Online (Sandbox Code Playgroud)

编辑

在较新的 Numba 版本(通过 Anaconda Python 发行版安装)中,我必须手动安装tbb才能获得有效的并行化。