pandas 频率表的描述性统计数据

Question

pandas 频率表的描述性统计数据

我有一个测试分数的频率表：

score    count
-----    -----
  77      1105
  78       940
  79      1222
  80      4339
etc

Run Code Online (Sandbox Code Playgroud)

我想显示由频率表总结的样本的基本统计数据和箱线图。（例如，上例的平均值为 79.16，中位数为 80。）

有没有办法在 Pandas 中做到这一点？我见过的所有例子都假设有一个个案表。

我想我可以生成一个个人分数列表，就像这样——

In [2]: s = pd.Series([77] * 1105 + [78] * 940 + [79] * 1222 + [80] * 4339)
In [3]: s.describe()
Out[3]: 
count    7606.000000
mean       79.156324
std         1.118439
min        77.000000
25%        78.000000
50%        80.000000
75%        80.000000
max        80.000000
dtype: float64

Run Code Online (Sandbox Code Playgroud)

——但我希望避免这种情况；真实非玩具数据集中的总频率高达数十亿。

任何帮助表示赞赏。

（我认为这是一个与使用describe()与加权数据不同的问题，后者是关于将权重应用于个别情况。）

Answer 1

ayh*_*han 5

这是一个计算频率分布的描述统计量的小函数：

# from __future__ import division (for Python 2)
def descriptives_from_agg(values, freqs):
    values = np.array(values)
    freqs = np.array(freqs)
    arg_sorted = np.argsort(values)
    values = values[arg_sorted]
    freqs = freqs[arg_sorted]
    count = freqs.sum()
    fx = values * freqs
    mean = fx.sum() / count
    variance = ((freqs * values**2).sum() / count) - mean**2
    variance = count / (count - 1) * variance  # dof correction for sample variance
    std = np.sqrt(variance)
    minimum = np.min(values)
    maximum = np.max(values)
    cumcount = np.cumsum(freqs)
    Q1 = values[np.searchsorted(cumcount, 0.25*count)]
    Q2 = values[np.searchsorted(cumcount, 0.50*count)]
    Q3 = values[np.searchsorted(cumcount, 0.75*count)]
    idx = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    result = pd.Series([count, mean, std, minimum, Q1, Q2, Q3, maximum], index=idx)
    return result

Run Code Online (Sandbox Code Playgroud)

演示：

np.random.seed(0)

val = np.random.normal(100, 5, 1000).astype(int)

pd.Series(val).describe()
Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

vc = pd.value_counts(val)
descriptives_from_agg(vc.index, vc.values)

Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

Run Code Online (Sandbox Code Playgroud)

请注意，这不能处理 NaN，也没有经过适当的测试。

归档时间：	9 年，3 月前
查看次数：	3928 次
最近记录：	6 年，10 月前