acd*_*cdr 7 python performance numpy
我正在寻找加速(或替换)我的算法来分组数据的方法.
我有一个numpy数组列表.我想生成一个新的numpy数组,这样对于原始数组也相同的每个索引,这个数组的每个元素都是相同的.(如果不是这样的话会有所不同.)
这听起来有点尴尬,所以有一个例子:
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# * *
Run Code Online (Sandbox Code Playgroud)
请注意,我标记的元素(索引0和4)具有相同的值(0
),因为原始的两个数组也是相同的(即10
和21
).类似于索引为3和5(3
)的元素.
该算法必须处理任意数量(大小相等)的输入数组,并且还为每个结果数返回它们对应的原始数组的值.(因此对于这个例子,"3"指的是(11, 22)
.)
这是我目前的算法:
import numpy as np
def groupify(values):
group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
group_meanings = {}
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
this_combo = {}
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
this_combo[curr_id] = needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
group_meanings[next_hash] = this_combo
next_hash += 1
return group, group_meanings
Run Code Online (Sandbox Code Playgroud)
请注意,value_array[matching] == needed_value
对于每个单独的索引,表达式会被多次评估,这是缓慢来自的地方.
我不确定我的算法是否可以加速,但我也不确定它是否是最佳算法.有没有更好的方法呢?
终于破解了矢量化解决方案!这是一个有趣的问题。问题是我们必须标记从列表的相应数组元素中获取的每对值。然后,我们应该根据每个这样的对在其他对中的唯一性来标记它们。因此,我们可以滥用np.unique
其所有可选参数,最后做一些额外的工作来保持最终输出的顺序。这是基本上分三个阶段完成的实施 -
# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
# Do the heavy work with np.unique to give us :
# 1. Starting indices of unique elems,
# 2. Srray that has unique IDs for each element in idx, and
# 3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
# Best part happens here : Use mask to ignore the repeated elems and re-tag
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]
Run Code Online (Sandbox Code Playgroud)
运行时测试
让我们将建议的矢量化方法与原始代码进行比较。由于提议的代码只为我们提供组 ID,因此为了公平的基准测试,让我们从原始代码中删除不用于提供组 ID 的部分。所以,这是函数定义 -
def groupify(values): # Original code
group = np.zeros((len(values[0]),), dtype=np.int64) - 1
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
next_hash += 1
return group
def groupify_vectorized(values): # Proposed code
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
return idx[mask].argsort()[unqID]
Run Code Online (Sandbox Code Playgroud)
具有大型数组的列表上的运行时结果 -
In [345]: # Input list with random elements
...: values = [item for item in np.random.randint(10,40,(10,10000))]
In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True
In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop
In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop
Run Code Online (Sandbox Code Playgroud)