Python:带有pandas的加权中值算法

sve*_*esh 13 python algorithm pandas

我有一个如下所示的数据框:

Out[14]:
    impwealth  indweight
16     180000     34.200
21     384000     37.800
26     342000     39.715
30    1154000     44.375
31     421300     44.375
32    1210000     45.295
33    1062500     45.295
34    1878000     46.653
35     876000     46.653
36     925000     53.476
Run Code Online (Sandbox Code Playgroud)

我想impwealth用频率权重计算列的加权中位数indweight.我的伪代码看起来像这样:

# Sort `impwealth` in ascending order 
df.sort('impwealth', 'inplace'=True)

# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)

# Search for the first occurrence of `impweight` that is greater than P 
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()

# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']
Run Code Online (Sandbox Code Playgroud)

这种方法看起来很笨重,我不确定它是否正确.我没有在pandas参考中找到内置方法来做到这一点.找到加权中位数的最佳方法是什么?

pro*_*der 11

如果你想在纯熊猫中做到这一点,这是一种方式.它也没有内插.(@svenkatesh,你错过了伪代码中的累积总和)

df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]
Run Code Online (Sandbox Code Playgroud)

这给出了925000的中位数.


chr*_*isb 7

你试过wqantiles包吗?我之前从未使用它,但它有一个加权中值函数,似乎至少给出了一个合理的答案(你可能想要仔细检查它是否正在使用你期望的方法).

In [12]: import weighted

In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772
Run Code Online (Sandbox Code Playgroud)

  • 拼写错误:wqantiles - > wquantiles (3认同)
  • 就我个人而言,我对安装几行代码就可以完成的包持谨慎态度,但如果您需要内插加权中位数,也许这是最好的方法。 (2认同)

Max*_*nis 6

此函数概括了校对者的解决方案:

def weighted_median(df, val, weight):
    df_sorted = df.sort_values(val)
    cumsum = df_sorted[weight].cumsum()
    cutoff = df_sorted[weight].sum() / 2.
    return df_sorted[cumsum >= cutoff][val].iloc[0]
Run Code Online (Sandbox Code Playgroud)

在这个例子中,它将是weighted_median(df, 'impwealth', 'indweight').


Max*_*nis 5

您可以使用 numpy将此解决方案用于加权百分位数:

def weighted_quantile(values, quantiles, sample_weight=None, 
                      values_sorted=False, old_style=False):
    """ Very close to numpy.percentile, but supports weights.
    NOTE: quantiles should be in [0, 1]!
    :param values: numpy.array with data
    :param quantiles: array-like with many quantiles needed
    :param sample_weight: array-like of the same length as `array`
    :param values_sorted: bool, if True, then will avoid sorting of
        initial array
    :param old_style: if True, will correct output to be consistent
        with numpy.percentile.
    :return: numpy.array with computed quantiles.
    """
    values = np.array(values)
    quantiles = np.array(quantiles)
    if sample_weight is None:
        sample_weight = np.ones(len(values))
    sample_weight = np.array(sample_weight)
    assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
        'quantiles should be in [0, 1]'

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        sample_weight = sample_weight[sorter]

    weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
    if old_style:
        # To be convenient with numpy.percentile
        weighted_quantiles -= weighted_quantiles[0]
        weighted_quantiles /= weighted_quantiles[-1]
    else:
        weighted_quantiles /= np.sum(sample_weight)
    return np.interp(quantiles, weighted_quantiles, values)
Run Code Online (Sandbox Code Playgroud)

调用为weighted_quantile(df.impwealth, quantiles=0.5, df.indweight).