计算numpy数组中的唯一项:为什么scipy.stats.itemfreq这么慢?

Jas*_*n S 8 python numpy scipy python-2.7

我正在尝试计算numpy数组中的唯一值.

import numpy as np
from collections import defaultdict
import scipy.stats
import time

x = np.tile([1,2,3,4,5,6,7,8,9,10],20000)
for i in [44,22,300,403,777,1009,800]:
    x[i] = 11

def getCounts(x):
    counts = defaultdict(int)
    for item in x:
        counts[item] += 1
    return counts

flist = [getCounts, scipy.stats.itemfreq]

for f in flist:
    print f
    t1 = time.time()
    y = f(x)
    t2 = time.time()
    print y
    print '%.5f sec' % (t2-t1)
Run Code Online (Sandbox Code Playgroud)

我起初找不到内置函数,所以我写道getCounts(); 然后我发现我scipy.stats.itemfreq以为我会用它来代替.但它很慢!这是我在电脑上得到的.与这么简单的手写功能相比,为什么这么慢?

<function getCounts at 0x0000000013C78438>
defaultdict(<type 'int'>, {1: 19998, 2: 20000, 3: 19999, 4: 19999, 5: 19999, 6: 20000, 7: 20000, 8: 19999, 9: 20000, 10: 19999, 11: 7})
0.04700 sec
<function itemfreq at 0x0000000013C5D208>
[[  1.00000000e+00   1.99980000e+04]
 [  2.00000000e+00   2.00000000e+04]
 [  3.00000000e+00   1.99990000e+04]
 [  4.00000000e+00   1.99990000e+04]
 [  5.00000000e+00   1.99990000e+04]
 [  6.00000000e+00   2.00000000e+04]
 [  7.00000000e+00   2.00000000e+04]
 [  8.00000000e+00   1.99990000e+04]
 [  9.00000000e+00   2.00000000e+04]
 [  1.00000000e+01   1.99990000e+04]
 [  1.10000000e+01   7.00000000e+00]]
2.04100 sec
Run Code Online (Sandbox Code Playgroud)

War*_*ser 19

如果你可以使用numpy 1.9,你可以使用numpy.unique参数return_counts=True.即

unique_items, counts = np.unique(x, return_counts=True)
Run Code Online (Sandbox Code Playgroud)

事实上,itemfreq更新后使用np.unique,但scipy目前支持numpy版本回1.5,所以它不使用return_counts参数.

这是itemfreqscipy 0.14 的完整实现:

def itemfreq(a):
    items, inv = np.unique(a, return_inverse=True)
    freq = np.bincount(inv)
    return np.array([items, freq]).T
Run Code Online (Sandbox Code Playgroud)