假设有许多重复项，使用 numpy 向量化“纯”函数

Question

假设有许多重复项，使用 numpy 向量化“纯”函数

hil*_*lem 6 python numpy unique vectorization pandas

我想将“黑匣子”Python 函数f应用于大数组arr。额外的假设是：

函数f是“纯的”，例如是确定性的，没有副作用。
数组arr具有少量唯一元素。

我可以使用一个装饰器来实现这一点，该装饰器f为每个唯一元素计算arr如下：

import numpy as np
from time import sleep
from functools import wraps


N = 1000
np.random.seed(0)
arr = np.random.randint(0, 10, size=(N, 2))


def vectorize_pure(f):
    @wraps(f)
    def f_vec(arr):
        uniques, ix = np.unique(arr, return_inverse=True)
        f_range = np.array([f(x) for x in uniques])
        return f_range[ix].reshape(arr.shape)
    return f_vec


@np.vectorize
def usual_vectorize(x):
    sleep(0.001)
    return x


@vectorize_pure
def pure_vectorize(x):
    sleep(0.001)
    return x

# In [47]: %timeit usual_vectorize(arr)                                
# 1.33 s ± 6.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# In [48]: %timeit pure_vectorize(arr)                                 
# 13.6 ms ± 81.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Run Code Online (Sandbox Code Playgroud)

我的担忧是在np.unique幕后排序arr，考虑到假设，这似乎效率低下。我正在寻找一种实用的方法来实现类似的装饰器

利用快速numpy矢量化操作。
不对输入数组进行排序。

我怀疑使用的答案是“是” numba，但我对numpy解决方案特别感兴趣。

此外，似乎取决于arr数据类型，numpy可能会使用基数排序，因此unique在某些情况下的性能可能会很好。

我在下面找到了一个解决方法，使用pandas.unique; 然而，它仍然需要对原始数组进行两次遍历，并pandas.unique做一些额外的工作。我想知道使用pandas._libs.hashtableandcython或其他任何东西是否存在更好的解决方案。

Answer 1

MSe*_*ert 5

您实际上可以通过一次遍历数组来完成此操作，但是它要求您dtype事先知道结果。否则，您需要对元素进行第二次遍历来确定它。

functools.wraps暂时忽略性能（和），实现可能如下所示：

def vectorize_cached(output_dtype):
    def vectorize_cached_factory(f):
        def f_vec(arr):
            flattened = arr.ravel()
            if output_dtype is None:
                result = np.empty_like(flattened)
            else:
                result = np.empty(arr.size, output_dtype)

            cache = {}
            for idx, item in enumerate(flattened):
                res = cache.get(item)
                if res is None:
                    res = f(item)
                    cache[item] = res
                result[idx] = res
            return result.reshape(arr.shape)
        return f_vec
    return vectorize_cached_factory

Run Code Online (Sandbox Code Playgroud)

它首先创建结果数组，然后迭代输入数组。一旦遇到字典中不存在的元素，就会调用该函数（并存储结果） - 否则它只是使用存储在字典中的值。

@vectorize_cached(np.float64)
def t(x):
    print(x)
    return x + 2.5

>>> t(np.array([1,1,1,2,2,2,3,3,1,1,1]))
1
2
3
array([3.5, 3.5, 3.5, 4.5, 4.5, 4.5, 5.5, 5.5, 3.5, 3.5, 3.5])

Run Code Online (Sandbox Code Playgroud)

然而，这并不是特别快，因为我们正在对 NumPy 数组进行 Python 循环。

Cython 解决方案

为了让它更快，我们实际上可以将此实现移植到 Cython（目前仅支持 float32、float64、int32、int64、uint32 和 uint64，但扩展几乎是微不足道的，因为它使用了融合类型）：

%%cython

cimport numpy as cnp

ctypedef fused input_type:
    cnp.float32_t
    cnp.float64_t
    cnp.uint32_t
    cnp.uint64_t
    cnp.int32_t
    cnp.int64_t

ctypedef fused result_type:
    cnp.float32_t
    cnp.float64_t
    cnp.uint32_t
    cnp.uint64_t
    cnp.int32_t
    cnp.int64_t

cpdef void vectorized_cached_impl(input_type[:] array, result_type[:] result, object func):
    cdef dict cache = {}
    cdef Py_ssize_t idx
    cdef input_type item
    for idx in range(array.size):
        item = array[idx]
        res = cache.get(item)
        if res is None:
            res = func(item)
            cache[item] = res
        result[idx] = res

Run Code Online (Sandbox Code Playgroud)

使用 Python 装饰器（以下代码未使用 Cython 编译）：

def vectorize_cached_cython(output_dtype):
    def vectorize_cached_factory(f):
        def f_vec(arr):
            flattened = arr.ravel()
            if output_dtype is None:
                result = np.empty_like(flattened)
            else:
                result = np.empty(arr.size, output_dtype)

            vectorized_cached_impl(flattened, result, f)

            return result.reshape(arr.shape)
        return f_vec
    return vectorize_cached_factory

Run Code Online (Sandbox Code Playgroud)

同样，这只会执行一次，并且每个唯一值仅应用一次该函数：

%%cython

cimport numpy as cnp

ctypedef fused input_type:
    cnp.float32_t
    cnp.float64_t
    cnp.uint32_t
    cnp.uint64_t
    cnp.int32_t
    cnp.int64_t

ctypedef fused result_type:
    cnp.float32_t
    cnp.float64_t
    cnp.uint32_t
    cnp.uint64_t
    cnp.int32_t
    cnp.int64_t

cpdef void vectorized_cached_impl(input_type[:] array, result_type[:] result, object func):
    cdef dict cache = {}
    cdef Py_ssize_t idx
    cdef input_type item
    for idx in range(array.size):
        item = array[idx]
        res = cache.get(item)
        if res is None:
            res = func(item)
            cache[item] = res
        result[idx] = res

Run Code Online (Sandbox Code Playgroud)

基准：快速功能，大量重复

但问题是：在这里使用 Cython 有意义吗？

I did a quick benchmark (without sleep) to get an idea how different the performance is (using my library simple_benchmark):

def func_to_vectorize(x):
    return x

usual_vectorize = np.vectorize(func_to_vectorize)
pure_vectorize = vectorize_pure(func_to_vectorize)
pandas_vectorize = vectorize_with_pandas(func_to_vectorize)
cached_vectorize = vectorize_cached(None)(func_to_vectorize) 
cython_vectorize = vectorize_cached_cython(None)(func_to_vectorize) 


from simple_benchmark import BenchmarkBuilder

b = BenchmarkBuilder()
b.add_function(alias='usual_vectorize')(usual_vectorize)
b.add_function(alias='pure_vectorize')(pure_vectorize)
b.add_function(alias='pandas_vectorize')(pandas_vectorize)
b.add_function(alias='cached_vectorize')(cached_vectorize)
b.add_function(alias='cython_vectorize')(cython_vectorize)

@b.add_arguments('array size')
def argument_provider():
    np.random.seed(0)
    for exponent in range(6, 20):
        size = 2**exponent
        yield size, np.random.randint(0, 10, size=(size, 2))

r = b.run()
r.plot()

Run Code Online (Sandbox Code Playgroud)

According to these times the ranking would be (fastest to slowest):

Cython version
Pandas solution (from another answer)
Pure solution (original post)
NumPys vectorize
The non-Cython version using Cache

The plain NumPy solution is only a factor 5-10 slower if the function call is very inexpensive. The pandas solution also has a much bigger constant factor, making it the slowest for very small arrays.

Benchmark: expensive function (`time.sleep(0.001)`), lots of duplicates

In case the function call is actually expensive (like with time.sleep) the np.vectorize solution will be a lot slower, however there is much less difference between the other solutions:

def vectorize_cached_cython(output_dtype):
    def vectorize_cached_factory(f):
        def f_vec(arr):
            flattened = arr.ravel()
            if output_dtype is None:
                result = np.empty_like(flattened)
            else:
                result = np.empty(arr.size, output_dtype)

            vectorized_cached_impl(flattened, result, f)

            return result.reshape(arr.shape)
        return f_vec
    return vectorize_cached_factory

Run Code Online (Sandbox Code Playgroud)

基准：快速功能，很少重复

但是，如果您没有那么多重复项，则平原np.vectorize几乎与 pure 和 pandas 解决方案一样快，仅比 Cython 版本慢一点：

@vectorize_cached_cython(np.float64)
def t(x):
    print(x)
    return x + 2.5

>>> t(np.array([1,1,1,2,2,2,3,3,1,1,1]))
1
2
3
array([3.5, 3.5, 3.5, 4.5, 4.5, 4.5, 5.5, 5.5, 3.5, 3.5, 3.5])

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，2 月前
查看次数：	595 次
最近记录：	5 年，7 月前

假设有许多重复项，使用 numpy 向量化“纯”函数

Cython 解决方案

基准：快速功能，大量重复

Benchmark: expensive function (time.sleep(0.001)), lots of duplicates

基准：快速功能，很少重复

Benchmark: expensive function (`time.sleep(0.001)`), lots of duplicates