在numpy中对独特元素的索引进行分组

Question

在numpy中对独特元素的索引进行分组

我有许多大型(> 100,000,000)整数列表,其中包含许多重复项.我想得到每个元素出现的索引.目前我正在做这样的事情:

import numpy as np
from collections import defaultdict

a = np.array([1, 2, 6, 4, 2, 3, 2])
d=defaultdict(list)
for i,e in enumerate(a):
    d[e].append(i)

d
defaultdict(<type 'list'>, {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]})

Run Code Online (Sandbox Code Playgroud)

这种迭代每个元素的方法是耗时的.有没有一种有效或矢量化的方法来做到这一点？

Edit1 我在下面尝试了Acorbe和Jaime的方法

a = np.random.randint(2000, size=10000000)

Run Code Online (Sandbox Code Playgroud)

结果是

original: 5.01767015457 secs
Acorbe: 6.11163902283 secs
Jaime: 3.79637312889 secs

Run Code Online (Sandbox Code Playgroud)

Answer 1

Jai*_*ime 7

这与此处的要求非常相似，因此下面是我在此处所作答复的改编。向量化的最简单方法是使用排序。以下代码从np.unique即将到来的1.9版本的实现中借鉴了很多东西，其中包括独特的项目计数功能，请参见此处：

>>> a = np.array([1, 2, 6, 4, 2, 3, 2])
>>> sort_idx = np.argsort(a)
>>> a_sorted = a[idx]
>>> unq_first = np.concatenate(([True], a_sorted[1:] != a_sorted[:-1]))
>>> unq_items = a_sorted[unq_first]
>>> unq_count = np.diff(np.nonzero(unq_first)[0])

Run Code Online (Sandbox Code Playgroud)

现在：

>>> unq_items
array([1, 2, 3, 4, 6])
>>> unq_count
array([1, 3, 1, 1, 1], dtype=int64)

Run Code Online (Sandbox Code Playgroud)

要获取每个值的位置索引，我们只需执行以下操作：

>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count))
>>> unq_idx
[array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),
 array([3], dtype=int64), array([2], dtype=int64)]

Run Code Online (Sandbox Code Playgroud)

现在，您可以构建字典zip unq_items和unq_idx。

请注意，unq_count这不计算最后一个唯一项的出现，因为不需要拆分索引数组。如果要获得所有值，可以执行以下操作：

>>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],)))
>>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))

Run Code Online (Sandbox Code Playgroud)

归档时间：	11 年，9 月前
查看次数：	2401 次
最近记录：	6 年，2 月前