Jus*_*guy 44 python performance mean
我mean将statistics模块的功能性能与简单的sum(l)/len(l)方法进行了比较,发现由于mean某种原因,该功能非常慢.我使用timeit下面的两个代码片段来比较它们,有没有人知道是什么原因导致执行速度的巨大差异?我正在使用Python 3.5.
from timeit import repeat
print(min(repeat('mean(l)',
'''from random import randint; from statistics import mean; \
l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))
Run Code Online (Sandbox Code Playgroud)
上面的代码在我的机器上执行大约0.043秒.
from timeit import repeat
print(min(repeat('sum(l)/len(l)',
'''from random import randint; from statistics import mean; \
l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))
Run Code Online (Sandbox Code Playgroud)
上面的代码在我的机器上执行大约0.000565秒.
Jiv*_*van 68
Python的statistics模块不是为了速度而构建的,而是为了精确而构建的
在这个模块的规格中,它似乎
处理具有不同程度的浮动时,内置和可能会失去准确性.因此,上述天真的意思未能通过这种"折磨测试"
assert mean([1e30, 1, 3, -1e30]) == 1返回0而不是1,纯粹的计算误差为100%.
在mean中使用math.fsum将使浮点数据更准确,但它也具有将任何参数转换为float的副作用,即使在不必要时也是如此.例如,我们应该期望分数列表的平均值是分数,而不是浮点数.
相反,如果我们看一下_sum()这个模块的实现,方法的docstring的第一行似乎证实:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
[...] """
Run Code Online (Sandbox Code Playgroud)
所以是的,statistics实现sum,而不是简单的单行调用Python的内置sum()函数,它本身需要大约20行,其中包含一个嵌套for循环.
这是因为statistics._sum选择保证它可能遇到的所有类型数字的最大精度(即使它们彼此差异很大),而不是简单地强调速度.
因此,内置sum证明速度快一百倍似乎是正常的.它的成本是一个低得多的精度,碰巧用异国情调的号码来称呼它.
其他选择
如果你需要优先考虑算法的速度,你应该看看Numpy,其算法在C中实现.
NumPy的意思并不像statistics长镜头一样精确,但它实现了(自2013年起)一个基于成对求和的例程,它比天真更好sum/len(链接中的更多信息).
然而...
import numpy as np
import statistics
np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])
print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))
> NumPy mean: 0.0
> Statistics mean: 1.0
Run Code Online (Sandbox Code Playgroud)
如果你关心速度使用numpy/scipy/pandas:
In [119]: from random import randint; from statistics import mean; import numpy as np;
In [122]: l=[randint(0, 10000) for i in range(10**6)]
In [123]: mean(l)
Out[123]: 5001.992355
In [124]: %timeit mean(l)
1 loop, best of 3: 2.01 s per loop
In [125]: a = np.array(l)
In [126]: np.mean(a)
Out[126]: 5001.9923550000003
In [127]: %timeit np.mean(a)
100 loops, best of 3: 2.87 ms per loop
Run Code Online (Sandbox Code Playgroud)
结论:它会快几个数量级 - 在我的例子中它快了700倍,但可能不那么精确(因为numpy不使用Kahan求和算法).
我前一段时间问了同样的问题,但是一旦我注意到在源317_sum行中调用的函数,我理解为什么:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
>>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
Run Code Online (Sandbox Code Playgroud)
与仅调用内置函数相比,发生了大量操作 sum,因为doc字符串mean计算了高精度和.
你可以看到使用mean vs sum可以给你不同的输出:
In [7]: l = [.1, .12312, 2.112, .12131]
In [8]: sum(l) / len(l)
Out[8]: 0.6141074999999999
In [9]: mean(l)
Out[9]: 0.6141075
Run Code Online (Sandbox Code Playgroud)
len()和sum()都是Python内置函数(功能有限),用C语言编写,更重要的是,它们经过优化,可以快速处理某些类型或对象(列表).
您可以在此处查看内置函数的实现:
https://hg.python.org/sandbox/python2.7/file/tip/Python/bltinmodule.c
statistics.mean()是一个用Python编写的高级函数.看看它是如何实现的:
https://hg.python.org/sandbox/python2.7/file/tip/Lib/statistics.py
您可以看到稍后在内部使用另一个名为_sum()的函数,与内置函数相比,它会执行一些额外的检查.
| 归档时间: |
|
| 查看次数: |
3894 次 |
| 最近记录: |