Cythonize字符串的所有拆分列表

Question

Cythonize字符串的所有拆分列表

我正在尝试加速一段代码,生成所有可能的字符串拆分.

splits('foo') -> [('f', 'oo'), ('fo', 'o'), ('foo', '')]

Run Code Online (Sandbox Code Playgroud)

python中的代码非常简单:

def splits(text):
    return [(text[:i + 1], text[i + 1:])
            for i in range(len(text))]

Run Code Online (Sandbox Code Playgroud)

有没有办法通过cython或其他方式加快速度？对于上下文,此代码的更大目的是找到具有最高概率的字符串的拆分.

Answer 1

Dav*_*idW 6

这不是Cython倾向于帮助的问题.它使用切片,最终与纯Python的速度大致相同(即实际上相当不错).

使用100个字符的长字节字符串(b'0'*100)和10000次迭代timeit我得到:

您编写的代码 - 0.37s
您编写的代码,但在Cython中编译 - 0.21s
你的代码与行cdef int i和Cython编译 - 0.20s(这是一个很小的改进.它更长,字符串更重要)
你的cdef int i和参数输入为bytes text- 0.28s(即更糟).

通过直接使用Python C API获得最佳速度(参见下面的代码) - 0.11s.为了方便起见,我选择在Cython(但是自己调用API函数)中这样做,但你可以直接在C中编写非常相似的代码,并进行更多的手动错误检查.我已经为Python 3 API编写了这个,假设你正在使用字节对象(即PyBytes代替PyString),所以如果你使用Python 2,或者Unicode和Python 3,你将不得不稍微改变它.

from cpython cimport *
cdef extern from "Python.h":
    # This isn't included in the cpython definitions
    # using PyObject* rather than object lets us control refcounting
    PyObject* Py_BuildValue(const char*,...) except NULL

def split(text):
   cdef Py_ssize_t l,i
   cdef char* s

   # Cython automatically checks the return value and raises an error if 
   # these fail. This provides a type-check on text
   PyBytes_AsStringAndSize(text,&s,&l)
   output = PyList_New(l)

   for i in range(l):
       # PyList_SET_ITEM steals a reference
       # the casting is necessary to ensure that Cython doesn't
       # decref the result of Py_BuildValue
       PyList_SET_ITEM(output,i,
                       <object>Py_BuildValue('y#y#',s,i+1,s+i+1,l-(i+1)))
   return output

Run Code Online (Sandbox Code Playgroud)

如果你不想一直使用C API,那么预先分配列表output = [None]*len(text)并执行for循环而不是列表理解的版本比原始版本更有效 - 0.18s

总而言之,只需在Cython中编译它就可以为你提供一个不错的加速(略低于2倍)并设置i一点帮助.这是您通过常规方式实现的所有功能.要获得全速,您基本上需要直接使用Python C API.这让你速度提高了4倍,我认为这相当不错.

归档时间：	8 年，9 月前
查看次数：	1156 次
最近记录：	8 年，9 月前