将数组排序到由索引数组指定的 bin 的最有效方法？

Question

将数组排序到由索引数组指定的 bin 的最有效方法？

Pau*_*zer 6 python numpy scipy pandas pythran

任务举例：

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx  = np.array([2, 0, 1, 1, 2, 0, 1, 1, 2])

Run Code Online (Sandbox Code Playgroud)

预期结果：

binned = np.array([2, 6, 3, 4, 7, 8, 1, 5, 9])

Run Code Online (Sandbox Code Playgroud)

约束：

应该很快。
应该是O(n+k)其中 n 是数据的长度，k 是 bin 的数量。
应该是稳定的，即保留 bin 内的顺序。

明显的解决方案

data[np.argsort(idx, kind='stable')]

Run Code Online (Sandbox Code Playgroud)

是O(n log n)。

O(n+k) 解决方案

def sort_to_bins(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    cnts = np.zeros(mx + 1, int)
    for i in range(idx.size):
        cnts[idx[i] + 1] += 1
    for i in range(1, cnts.size):
        cnts[i] += cnts[i-1]
    res = np.empty_like(data)
    for i in range(data.size):
        res[cnts[idx[i]]] = data[i]
        cnts[idx[i]] += 1
    return res

Run Code Online (Sandbox Code Playgroud)

是循环和缓慢。

纯numpy< scipy< pandas< numba/有没有更好的方法pythran？

Answer 1

Pau*_*zer 7

这里有几个解决方案：

使用np.argsort反正，毕竟这是快速编译代码。
使用np.bincount得到的块大小以及np.argpartition它是O(n)固定数目个二进制位。缺点：目前没有稳定的算法可用，因此我们必须对每个 bin 进行排序。
使用scipy.ndimage.measurements.labeled_comprehension. 这大致完成了所需的工作，但不知道它是如何实现的。
使用pandas. 我是一个完整的pandas菜鸟，所以我在这里拼凑的东西groupby可能不是最理想的。
使用scipy.sparse压缩稀疏行和压缩稀疏列格式之间的切换恰好实现了我们正在寻找的确切操作。
在问题中的循环代码上使用pythran（我确定numba也可以）。所需要做的就是在 numpy 导入后在顶部插入

.

#pythran export sort_to_bins(int[:], float[:], int)

Run Code Online (Sandbox Code Playgroud)

然后编译

# pythran stb_pthr.py

Run Code Online (Sandbox Code Playgroud)

基准 100 个 bin，可变数量的项目：

带回家：

如果你对numba/ 没问题pythran，那就是要走的路，如果不是，则可以很好地scipy.sparse扩展。

代码：

import numpy as np
from scipy import sparse
from scipy.ndimage.measurements import labeled_comprehension
from stb_pthr import sort_to_bins as sort_to_bins_pythran
import pandas as pd

def sort_to_bins_pandas(idx, data, mx=-1):
    df = pd.DataFrame.from_dict(data=data)
    out = np.empty_like(data)
    j = 0
    for grp in df.groupby(idx).groups.values():
        out[j:j+len(grp)] = data[np.sort(grp)]
        j += len(grp)
    return out

def sort_to_bins_ndimage(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    out = np.empty_like(data)
    j = 0
    def collect(bin):
        nonlocal j
        out[j:j+len(bin)] = np.sort(bin)
        j += len(bin)
        return 0
    labeled_comprehension(data, idx, np.arange(mx), collect, data.dtype, None)
    return out

def sort_to_bins_partition(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    return data[np.argpartition(idx, np.bincount(idx, None, mx)[:-1].cumsum())]

def sort_to_bins_partition_stable(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1
    split = np.bincount(idx, None, mx)[:-1].cumsum()
    srt = np.argpartition(idx, split)
    for bin in np.split(srt, split):
        bin.sort()
    return data[srt]

def sort_to_bins_sparse(idx, data, mx=-1):
    if mx==-1:
        mx = idx.max() + 1    
    return sparse.csr_matrix((data, idx, np.arange(len(idx)+1)), (len(idx), mx)).tocsc().data

def sort_to_bins_argsort(idx, data, mx=-1):
    return data[idx.argsort(kind='stable')]

from timeit import timeit
exmpls = [np.random.randint(0, K, (N,)) for K, N in np.c_[np.full(16, 100), 1<<np.arange(5, 21)]]

timings = {}
for idx in exmpls:
    data = np.arange(len(idx), dtype=float)
    ref = None
    for x, f in (*globals().items(),):
        if x.startswith('sort_to_bins_'):
            timings.setdefault(x.replace('sort_to_bins_', '').replace('_', ' '), []).append(timeit('f(idx, data, -1)', globals={'f':f, 'idx':idx, 'data':data}, number=10)*100)
            if x=='sort_to_bins_partition':
                continue
            if ref is None:
                ref = f(idx, data, -1)
            else:
                assert np.all(f(idx, data, -1)==ref)

import pylab
for k, v in timings.items():
    pylab.loglog(1<<np.arange(5, 21), v, label=k)
pylab.xlabel('#items')
pylab.ylabel('time [ms]')
pylab.legend()
pylab.show()

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，7 月前
查看次数：	1055 次
最近记录：	6 年，7 月前