快速算法查找多个数组具有相同值的索引

acd*_*cdr 7 python performance numpy

我正在寻找加速(或替换)我的算法来分组数据的方法.

我有一个numpy数组列表.我想生成一个新的numpy数组,这样对于原始数组也相同的每个索引,这个数组的每个元素都是相同的.(如果不是这样的话会有所不同.)

这听起来有点尴尬,所以有一个例子:

# Test values:
values = [
    np.array([10, 11, 10, 11, 10, 11, 10]),
    np.array([21, 21, 22, 22, 21, 22, 23]),
    ]

# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
#                             *           *
Run Code Online (Sandbox Code Playgroud)

请注意,我标记的元素(索引0和4)具有相同的值(0),因为原始的两个数组也是相同的(即1021).类似于索引为3和5(3)的元素.

该算法必须处理任意数量(大小相等)的输入数组,并且还为每个结果数返回它们对应的原始数组的值.(因此对于这个例子,"3"指的是(11, 22).)

这是我目前的算法:

import numpy as np

def groupify(values):
    group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
    group_meanings = {}
    next_hash = 0
    matching = np.ones((len(values[0]),), dtype=bool)
    while any(group == -1):
        this_combo = {}

        matching[:] = (group == -1)
        first_ungrouped_idx = np.where(matching)[0][0]

        for curr_id, value_array in enumerate(values):
            needed_value = value_array[first_ungrouped_idx]
            matching[matching] = value_array[matching] == needed_value
            this_combo[curr_id] = needed_value
        # Assign all of the found elements to a new group
        group[matching] = next_hash
        group_meanings[next_hash] = this_combo
        next_hash += 1
    return group, group_meanings
Run Code Online (Sandbox Code Playgroud)

请注意,value_array[matching] == needed_value对于每个单独的索引,表达式会被多次评估,这是缓慢来自的地方.

我不确定我的算法是否可以加速,但我也不确定它是否是最佳算法.有没有更好的方法呢?

Div*_*kar 3

终于破解了矢量化解决方案!这是一个有趣的问题。问题是我们必须标记从列表的相应数组元素中获取的每对值。然后,我们应该根据每个这样的对在其他对中的唯一性来标记它们。因此,我们可以滥用np.unique其所有可选参数,最后做一些额外的工作来保持最终输出的顺序。这是基本上分三个阶段完成的实施 -

# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)

# Do the heavy work with np.unique to give us :
#   1. Starting indices of unique elems, 
#   2. Srray that has unique IDs for each element in idx, and 
#   3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)

# Best part happens here : Use mask to ignore the repeated elems and re-tag 
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]
Run Code Online (Sandbox Code Playgroud)

运行时测试

让我们将建议的矢量化方法与原始代码进行比较。由于提议的代码只为我们提供组 ID,因此为了公平的基准测试,让我们从原始代码中删除不用于提供组 ID 的部分。所以,这是函数定义 -

def groupify(values):  # Original code
    group = np.zeros((len(values[0]),), dtype=np.int64) - 1
    next_hash = 0
    matching = np.ones((len(values[0]),), dtype=bool)
    while any(group == -1):

        matching[:] = (group == -1)
        first_ungrouped_idx = np.where(matching)[0][0]

        for curr_id, value_array in enumerate(values):
            needed_value = value_array[first_ungrouped_idx]
            matching[matching] = value_array[matching] == needed_value
        # Assign all of the found elements to a new group
        group[matching] = next_hash
        next_hash += 1
    return group

def groupify_vectorized(values):  # Proposed code
    arr = np.vstack(values)
    idx = np.ravel_multi_index(arr,arr.max(1)+1)
    _,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)    
    mask = ~np.in1d(unqID,np.where(count>1)[0])
    mask[unq_start_idx] = 1
    return idx[mask].argsort()[unqID]
Run Code Online (Sandbox Code Playgroud)

具有大型数组的列表上的运行时结果 -

In [345]: # Input list with random elements
     ...: values = [item for item in np.random.randint(10,40,(10,10000))]

In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True

In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop

In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop
Run Code Online (Sandbox Code Playgroud)