为什么"中位数"比使用统计包的"均值"快2倍？

Question

为什么"中位数"比使用统计包的"均值"快2倍？

Agu*_*guy 4 python statistics numpy python-3.x

这让我感到惊讶......为了说明我使用这个小代码来计算1M随机数的平均值和中位数:

import numpy as np
import statistics as st

import time

listofrandnum = np.random.rand(1000000,)

t = time.time()
print('mean is:', st.mean(listofrandnum))
print('time to calc mean:', time.time()-t)

print('\n')

t = time.time()
print('median is:', st.median(listofrandnum))
print('time to calc median:', time.time()-t)

Run Code Online (Sandbox Code Playgroud)

结果如下:

mean is: 0.499866595037
time to calc mean: 2.0767598152160645


median is: 0.499721597395
time to calc median: 0.9687695503234863

Run Code Online (Sandbox Code Playgroud)

我的问题:为什么平均值比中位数慢？中位数需要一些排序算法(即比较),而均值需要求和.总和是否比比较慢？

我将非常感谢您对此的见解.

Answer 1

use*_*ica 9

statistics不是NumPy的一部分.它是一个Python标准库模块,具有相当不同的设计理念; 它可以不惜一切代价获得准确性,即使对于异常输入数据类型和极差条件输入也是如此.以statistics模块执行方式执行求和非常昂贵,而不是执行排序.

如果您想在NumPy数组上获得有效的均值或中位数,请使用NumPy例程:

numpy.mean(whatever)
numpy.median(whatever)

Run Code Online (Sandbox Code Playgroud)

如果你想看到statistics模块经过的简单工作所需的昂贵工作,你可以查看源代码:

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    If optional argument ``start`` is given, it is added to the total.
    If ``data`` is empty, ``start`` (defaulting to 0) is returned.


    Examples
    --------

    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
    (<class 'float'>, Fraction(11, 1), 5)

    Some sources of round-off error will be avoided:

    >>> _sum([1e50, 1, -1e50] * 1000)  # Built-in sum returns zero.
    (<class 'float'>, Fraction(1000, 1), 3000)

    Fractions and Decimals are also supported:

    >>> from fractions import Fraction as F
    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)

    >>> from decimal import Decimal as D
    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
    >>> _sum(data)
    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)

    Mixed types are currently treated as an error, except that int is
    allowed.
    """
    count = 0
    n, d = _exact_ratio(start)
    partials = {d: n}
    partials_get = partials.get
    T = _coerce(int, type(start))
    for typ, values in groupby(data, type):
        T = _coerce(T, typ)  # or raise TypeError
        for n,d in map(_exact_ratio, values):
            count += 1
            partials[d] = partials_get(d, 0) + n
    if None in partials:
        # The sum will be a NAN or INF. We can ignore all the finite
        # partials, and just look at this special one.
        total = partials[None]
        assert not _isfinite(total)
    else:
        # Sum all the partial sums using builtin sum.
        # FIXME is this faster if we sum them in order of the denominator?
        total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
    return (T, total, count)

Run Code Online (Sandbox Code Playgroud)

使用numpy方法,我的计算机平均需要2毫秒,中位数需要16毫秒. (4认同)

归档时间：	9 年，5 月前
查看次数：	599 次
最近记录：	9 年，5 月前