mor*_*ens 9 python arrays django numpy
我试图在numpy数组中得到所有重复元素的索引,但我现在发现的解决方案对于大型(> 20000个元素)输入数组来说是非常低效的(它需要大约9秒钟).这个想法很简单:
records_array
是一个numpy时间戳数组(timedate),我们要从中提取重复时间戳的索引
time_array
是一个numpy数组,包含重复的所有时间戳 records_array
records
是一个包含一些Record对象的django QuerySet(可以很容易地转换为列表).我们想要创建一个由Record的tagId属性的所有可能组合形成的对的列表,对应于从中找到的重复时间戳records_array
.
这是我目前的工作(但效率低下)代码:
tag_couples = [];
for t in time_array:
users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t
l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t
if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated
tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples
Run Code Online (Sandbox Code Playgroud)
我很确定这可以通过使用Numpy进行优化,但是我找不到一种方法来比较records_array
每个元素time_array
而不使用for循环(这不能通过使用来比较==
,因为它们都是数组).
gg3*_*349 22
像往常一样的解决方案与numpy的魔力unique()
,没有循环或列表理解:
records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2])
idx_sort = argsort(records_array)
sorted_records_array = records_array[idx_sort]
vals, idx_start, count = unique(sorted_records_array, return_counts=True,
return_index=True)
# sets of indices
res = split(idx_sort, idx_start[1:])
#filter them with respect to their size, keeping only items occurring more than once
vals = vals[count > 1]
res = filter(lambda x: x.size > 1, res)
Run Code Online (Sandbox Code Playgroud)
编辑:以下是我以前的答案,需要更多的内存,使用numpy
广播和调用unique
两次:
records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2])
vals, inverse, count = unique(records_array, return_inverse=True,
return_counts=True)
idx_vals_repeated = where(count > 1)[0]
vals_repeated = vals[idx_vals_repeated]
rows, cols = where(inverse == idx_vals_repeated[:, newaxis])
_, inverse_rows = unique(rows, return_index=True)
res = split(cols, inverse_rows[1:])
Run Code Online (Sandbox Code Playgroud)
如预期的那样 res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]
Tre*_*ney 15
np.where
比defaultdict
最多约 200 个唯一元素的速度快,但比pandas.core.groupby.GroupBy.indices
, 和慢np.unique
。pandas
, 的解决方案是大型阵列的最快解决方案。defaultdict
对于大约 2400 个元素的数组,尤其是具有大量唯一元素的数组,是一个快速选项。%timeit
import random
import numpy
import pandas as pd
from collections import defaultdict
def dd(l):
# default_dict test
indices = defaultdict(list)
for i, v in enumerate(l):
indices[v].append(i)
return indices
def npw(l):
# np_where test
return {v: np.where(l == v)[0] for v in np.unique(l)}
def uni(records_array):
# np_unique test
idx_sort = np.argsort(records_array)
sorted_records_array = records_array[idx_sort]
vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)
res = np.split(idx_sort, idx_start[1:])
return dict(zip(vals, res))
def daf(l):
# pandas test
return pd.DataFrame(l).groupby([0]).indices
data = defaultdict(list)
for x in range(4, 20000, 100): # number of unique elements
# create 2M element list
random.seed(365)
a = np.array([random.choice(range(x)) for _ in range(2000000)])
res1 = %timeit -r2 -n1 -q -o dd(a)
res2 = %timeit -r2 -n1 -q -o npw(a)
res3 = %timeit -r2 -n1 -q -o uni(a)
res4 = %timeit -r2 -n1 -q -o daf(a)
data['defaut_dict'].append(res1.average)
data['np_where'].append(res2.average)
data['np_unique'].append(res3.average)
data['pandas'].append(res4.average)
data['idx'].append(x)
df = pd.DataFrame(data)
df.set_index('idx', inplace=True)
df.plot(figsize=(12, 5), xlabel='unique samples', ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
plt.show()
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
11038 次 |
最近记录: |