33 python algorithm optimization performance cython
我正在尝试找到一种有效的方法,将包含整数点的数据行组合在一起,并将它们存储为Python对象.该数据是由X和Y坐标点,表示为逗号分隔的字符串.这些点必须配对,如(x_1, y_1), (x_2, y_2), ...等,然后存储为对象列表,其中每个点都是一个对象.下面的函数get_data生成此示例数据:
def get_data(N=100000, M=10):
    import random
    data = []
    for n in range(N):
        pair = [[str(random.randint(1, 10)) for x in range(M)],
                [str(random.randint(1, 10)) for x in range(M)]]
        row = [",".join(pair[0]),
               ",".join(pair[1])]
        data.append(row)
    return data
我现在的解析代码是:
class Point:
    def __init__(self, a, b):
        self.a = a
        self.b = b
def test():
    import time
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    time_end = time.time()
    print "total time: ", (time_end - time_start)
目前,100,000点需要近7秒,这看起来非常低效.效率低下的部分似乎从计算干first_points,second_points和paired_points-以及这些转化为对象.
效率低下的另一部分似乎是建立起来的all_point_sets.取出all_point_sets.append(...)线似乎使代码从~7秒到2秒!
怎么加速呢?谢谢.
关注感谢大家的好建议 - 他们都很有帮助.但即使有了所有改进,处理100,000个条目仍然需要3秒钟.我不确定为什么在这种情况下它不仅仅是即时的,以及是否有一种可以立即实现的替代表示.在Cython中编码这会改变一些东西吗?有人可以提供一个例子吗?再次感谢.
Mat*_*son 20
在处理大量对象的创建时,通常可以使用的单个最大性能增强功能是关闭垃圾收集器.每个"生成"对象,垃圾收集器遍历内存中的所有活动对象,查找作为循环的一部分但未被活动对象指向的对象,因此有资格进行内存回收.有关一些信息,请参阅Doug Helmann的PyMOTW GC文章(也许可以通过谷歌和一些决心找到更多信息).默认情况下,垃圾收集器每700个左右创建一个但未回收的对象,后续运行的次数少一些(我忘记确切的细节).
使用标准元组而不是Point类可以节省一些时间(使用namedtuple介于两者之间),巧妙的解包可以节省一些时间,但是在创建批次之前关闭gc可以获得最大的收益您知道的对象不需要gc'd,然后再将其重新打开.
一些代码:
def orig_test_gc_off():
    import time
    data = get_data()
    all_point_sets = []
    import gc
    gc.disable()
    time_start = time.time()
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "gc off total time: ", (time_end - time_start)
def test1():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    gc.disable()
    for index, row in enumerate(data):
        first_points, second_points = row
        curr_points = map(
            Point,
            [int(i) for i in first_points.split(",")],
            [int(i) for i in second_points.split(",")])
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 1 total time: ", (time_end - time_start)
def test2():
    import time
    import gc
    data = get_data()
    all_point_sets = []
    gc.disable()
    time_start = time.time()
    for index, row in enumerate(data):
        first_points, second_points = row
        first_points = [int(i) for i in first_points.split(",")]
        second_points = [int(i) for i in second_points.split(",")]
        curr_points = [(x, y) for x, y in zip(first_points, second_points)]
        all_point_sets.append(curr_points)
    time_end = time.time()
    gc.enable()
    print "variant 2 total time: ", (time_end - time_start)
orig_test()
orig_test_gc_off()
test1()
test2()
一些结果:
>>> %run /tmp/flup.py
total time:  6.90738511086
gc off total time:  4.94075202942
variant 1 total time:  4.41632509232
variant 2 total time:  3.23905301094
Joh*_*ooy 15
简单地与pypy一起运行会产生很大的不同
$ python pairing_strings.py 
total time:  2.09194397926
$ pypy pairing_strings.py 
total time:  0.764246940613
禁用gc对pypy没有帮助
$ pypy pairing_strings.py 
total time:  0.763386964798
Point的namedtuple让事情变得更糟
$ pypy pairing_strings.py 
total time:  0.888827085495
使用itertools.imap和itertools.izip
$ pypy pairing_strings.py 
total time:  0.615751981735
使用memoized版本的int和迭代器来避免zip
$ pypy pairing_strings.py 
total time:  0.423738002777 
这是我完成的代码.
def test():
    import time
    def m_int(s, memo={}):
        if s in memo:
            return memo[s]
        else:
            retval = memo[s] = int(s)
            return retval
    data = get_data()
    all_point_sets = []
    time_start = time.time()
    for xs, ys in data:
        point_set = []
        # Convert points from strings to integers
        y_iter = iter(ys.split(","))
        curr_points = [Point(m_int(i), m_int(next(y_iter))) for i in xs.split(",")]
        all_point_sets.append(curr_points)
    time_end = time.time()
    print "total time: ", (time_end - time_start)
我会
numpy数组来解决这个问题(Cython如果这还不够快,那将是一个选项).Point实例.例如,Numpy内置了读取文本文件的功能loadtxt.如果您将数据存储在结构化数组中,则不一定需要将其转换为其他数据类型.我将使用Pandas,这是一个基于的库构建numpy.处理和处理结构化数据更方便一些.Pandas有自己的文件解析器read_csv.
为了计时,我将数据写入文件,就像你原来的问题(它是基于你的get_data):
import numpy as np
import pandas as pd
def create_example_file(n=100000, m=20):
    ex1 = pd.DataFrame(np.random.randint(1, 10, size=(10e4, m)),
                       columns=(['x_%d' % x for x in range(10)] +
                                ['y_%d' % y for y in range(10)]))
    ex1.to_csv('example.csv', index=False, header=False)
    return
这是我用来读取数据的代码pandas.DataFrame:
def with_read_csv(csv_file):
    df = pd.read_csv(csv_file, header=None,
                     names=(['x_%d' % x for x in range(10)] +
                            ['y_%d' % y for y in range(10)]))
    return df
(请注意,我假设您的文件中没有标题,因此我必须创建列名.)
读取数据的速度很快,内存效率应该更高(请参阅此问题)并且数据存储在数据结构中,您可以使用快速,矢量化的方式进一步处理:
In [18]: %timeit string_to_object.with_read_csv('example.csv')
1 loops, best of 3: 553 ms per loop
在开发分支中有一个新的基于C的解析器,在我的系统上需要414毫秒.您的测试在我的系统上需要2.29秒,但它不具有可比性,因为数据不是从文件中读取而是您创建了Point实例.
如果您曾读过数据,则可以将其存储在hdf5文件中:
In [19]: store = pd.HDFStore('example.h5')
In [20]: store['data'] = df
In [21]: store.close()
下次需要数据时,您可以从此文件中读取数据,这非常快:
In [1]: store = pd.HDFStore('example.h5')
In [2]: %timeit df = store['data']
100 loops, best of 3: 16.5 ms per loop
但是,如果您需要多次使用相同的数据,它将仅适用.
numpy当您进行进一步的计算时,使用具有大型数据集的基础数组将具有优势.Cython如果你可以使用矢量化numpy函数和索引,它不一定会更快,如果你真的需要迭代它会更快(参见这个答案).
更快的方法,使用Numpy(加速大约7倍):
import numpy as np
txt = ','.join(','.join(row) for row in data)
arr = np.fromstring(txt, dtype=int, sep=',')
return arr.reshape(100000, 2, 10).transpose((0,2,1))
表现比较:
def load_1(data):
    all_point_sets = []
    gc.disable()
    for xs, ys in data:
        all_point_sets.append(zip(map(int, xs.split(',')), map(int, ys.split(','))))
    gc.enable()
    return all_point_sets
def load_2(data):
    txt = ','.join(','.join(row) for row in data)
    arr = np.fromstring(txt, dtype=int, sep=',')
    return arr.reshape(100000, 2, 10).transpose((0,2,1))
load_1在我的机器上运行1.52秒; load_2在0.20秒内运行,提高了7倍.这里最大的警告是,它要求你(1)事先知道所有事物的长度,(2)每行包含完全相同的点数.这适用于您的get_data输出,但可能不适用于您的真实数据集.
我通过使用数组获得了50%的改进,并且在访问时懒惰地构造了Point对象的持有者对象.我还"插入"Point对象以获得更好的存储效率.但是,元组可能会更好.
如果可能的话,更改数据结构也可能有所帮助.但这永远不会是瞬间的.
from array import array
class Point(object):
    __slots__ = ["a", "b"]
    def __init__(self, a, b):
        self.a = a
        self.b = b
    def __repr__(self):
        return "Point(%d, %d)" % (self.a, self.b)
class Points(object):
    def __init__(self, xs, ys):
        self.xs = xs
        self.ys = ys
    def __getitem__(self, i):
        return Point(self.xs[i], self.ys[i])
def test3():
    xs = array("i")
    ys = array("i")
    time_start = time.time()
    for row in data:
        xs.extend([int(val) for val in row[0].split(",")])
        ys.extend([int(val) for val in row[1].split(",")])
    print ("total time: ", (time.time() - time_start))
    return Points(xs, ys)
但是当处理大量数据时,我通常会使用numpy N维数组(ndarray).如果原始数据结构可以改变,那么这可能是最快的.如果它可以被构造成读取x,y线性对,然后重塑ndarray.
制作Point一个namedtuple(加速度约10%):
from collections import namedtuple
Point = namedtuple('Point', 'a b')
在迭代期间解压缩(〜2-4%加速):
for xs, ys in data:
使用n-argument形式map来避免zip(~10%加速):
curr_points = map(Point,
    map(int, xs.split(',')),
    map(int, ys.split(',')),
)
鉴于点集很短,生成器可能过度,因为它们具有更高的固定开销.
cython能够将速度提高5.5倍
$ python split.py
total time:  2.16252303123
total time:  0.393486022949
这是我使用的代码
import time
import pyximport; pyximport.install()
from split_ import test_
def get_data(N=100000, M=10):
    import random
    data = []
    for n in range(N):
        pair = [[str(random.randint(1, 100)) for x in range(M)],
                [str(random.randint(1, 100)) for x in range(M)]]
        row = [",".join(pair[0]),
               ",".join(pair[1])]
        data.append(row)
    return data
class Point:
    def __init__(self, a, b):
        self.a = a
        self.b = b
def test(data):
    all_point_sets = []
    for row in data:
        point_set = []
        first_points, second_points = row
        # Convert points from strings to integers
        first_points = map(int, first_points.split(","))
        second_points = map(int, second_points.split(","))
        paired_points = zip(first_points, second_points)
        curr_points = [Point(p[0], p[1]) \
                       for p in paired_points]
        all_point_sets.append(curr_points)
    return all_point_sets
data = get_data()
for func in test, test_:
    time_start = time.time()
    res = func(data)
    time_end = time.time()
    print "total time: ", (time_end - time_start)
from libc.string cimport strsep
from libc.stdlib cimport atoi
cdef class Point:
    cdef public int a,b
    def __cinit__(self, a, b):
        self.a = a
        self.b = b
def test_(data):
    cdef char *xc, *yc, *xt, *yt
    cdef char **xcp, **ycp
    all_point_sets = []
    for xs, ys in data:
        xc = xs
        xcp = &xc
        yc = ys
        ycp = &yc
        point_set = []
        while True:
            xt = strsep(xcp, ',')
            if xt is NULL:
                break
            yt = strsep(ycp, ",")
            point_set.append(Point(atoi(xt), atoi(yt)))
        all_point_sets.append(point_set)
    return all_point_sets
进一步探索我可以大致分解一些cpu资源
         5% strsep()
         9% atoi()
        23% creating Point instances
        35% all_point_sets.append(point_set)
我希望如果cython能够直接从csv(或其他)文件中读取而不必通过Python对象进行搜索,那么可能会有所改进.