使用 cython 加速数千个集合操作

Dav*_*agh 6 optimization cython python-2.7

我一直在努力克服对 Cython 的恐惧(恐惧是因为我对 c 或 c++ 一无所知)

我有一个函数,它接受 2 个参数,一个集合(我们称之为testSet)和一个集合列表(我们称之为targetSets)。然后该函数迭代targetSets,并计算与 的交集长度testSet,将该值添加到列表中,然后返回该列表。

现在,这本身并没有那么慢,但问题是我需要对 testSet 进行模拟(数量很大,约 10,000 个),而 targetSet 大约有 10,000 个集长。

因此,对于要测试的少量模拟,纯 Python 实现大约需要 50 秒。

我尝试制作一个 cython 函数,它成功了,现在运行时间约为 16 秒。

如果我可以对任何人都能想到的 cython 函数做任何其他事情,那就太好了(python 2.7 btw)

这是我在重叠Func.pyx中的 Cython 实现

def computeOverlap(set testSet, list targetSets):
    cdef list obsOverlaps  = []
    cdef int i, N
    cdef set overlap
    N = len(targetSets)
    for i in range(N):
        overlap = testSet & targetSets[i]
        if len(overlap) <= 1:
            obsOverlaps.append(0)
        else:
            obsOverlaps.append(len(overlap))
    return obsOverlaps
Run Code Online (Sandbox Code Playgroud)

setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("overlapFunc", 
                         ["overlapFunc.pyx"])]

setup(
      name = 'computeOverlap function',
      cmdclass = {'build_ext': build_ext},
      ext_modules = ext_modules
      )
Run Code Online (Sandbox Code Playgroud)

以及一些代码来构建一些随机集以进行测试并对函数进行计时。测试.py

import numpy as np
from overlapFunc import computeOverlap
import time

def simRandomSet(n):
    for i in range(n):
        simSet= set(np.random.randint(low=1, high=100, size=50))
        yield simSet


if __name__ == '__main__':
    np.random.seed(23032014)
    targetSet = [set(np.random.randint(low=1, high=100, size=50)) for i in range(10000)]

    simulatedTestSets = simRandomSet(200)
    start = time.time()
    for i in simulatedTestSets:
        obsOverlaps = computeOverlap(i, targetSet)
    print time.time()-start
Run Code Online (Sandbox Code Playgroud)

我尝试更改 ComputerOverlap 函数开头的 def,如下所示:

cdef list computeOverlap(set testSet, list targetSets):
Run Code Online (Sandbox Code Playgroud)

但当我运行setup.py脚本时收到以下警告消息:

'__pyx_f_11overlapFunc_computeOverlap' defined but not used [-Wunused-function]
Run Code Online (Sandbox Code Playgroud)

然后当我运行尝试使用该函数的东西时,我收到导入错误:

    from overlapFunc import computeOverlap
ImportError: cannot import name computeOverlap
Run Code Online (Sandbox Code Playgroud)

在此先感谢您的帮助,

干杯,

戴维

fal*_*tru 2

在下面的行中,扩展模块名称和文件名与实际文件名不匹配。

ext_modules = [Extension("computeOverlapWithGeneList", 
                         ["computeOverlapWithGeneList.pyx"])]
Run Code Online (Sandbox Code Playgroud)

将其替换为:

ext_modules = [Extension("overlapFunc",
                         ["overlapFunc.pyx"])]
Run Code Online (Sandbox Code Playgroud)