使用numpy的加权百分位数

use*_*827 22 python numpy percentile weighted

有没有办法使用numpy.percentile函数来计算加权百分位数?或者是否有人知道替代python函数来计算加权百分位数?

谢谢!

All*_*leo 37

完全矢量化的numpy解决方案

这是我正在使用的代码.它不是最佳的(我无法写入numpy),但仍然比接受的解决方案更快,更可靠

def weighted_quantile(values, quantiles, sample_weight=None, 
                      values_sorted=False, old_style=False):
    """ Very close to numpy.percentile, but supports weights.
    NOTE: quantiles should be in [0, 1]!
    :param values: numpy.array with data
    :param quantiles: array-like with many quantiles needed
    :param sample_weight: array-like of the same length as `array`
    :param values_sorted: bool, if True, then will avoid sorting of
        initial array
    :param old_style: if True, will correct output to be consistent
        with numpy.percentile.
    :return: numpy.array with computed quantiles.
    """
    values = np.array(values)
    quantiles = np.array(quantiles)
    if sample_weight is None:
        sample_weight = np.ones(len(values))
    sample_weight = np.array(sample_weight)
    assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
        'quantiles should be in [0, 1]'

    if not values_sorted:
        sorter = np.argsort(values)
        values = values[sorter]
        sample_weight = sample_weight[sorter]

    weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
    if old_style:
        # To be convenient with numpy.percentile
        weighted_quantiles -= weighted_quantiles[0]
        weighted_quantiles /= weighted_quantiles[-1]
    else:
        weighted_quantiles /= np.sum(sample_weight)
    return np.interp(quantiles, weighted_quantiles, values)
Run Code Online (Sandbox Code Playgroud)

例子:

weighted_quantile([1,2,9,3.2,4],[0.0,0.5,1.])

数组([1.,3.2,9.])

weighted_quantile([1,2,9,3.2,4],[0.0,0.5,1],sample_weight = [2,1,2,4,1])

数组([1.,3.2,9.])

  • 好的代码。old_style有什么区别?我还没讲清楚。 (2认同)
  • wiki 网页最后一节中介绍的有关加权百分位数的方法的良好实现[链接](https://en.wikipedia.org/wiki/Percentile#Definition_of_the_Weighted_Percentile_method)。 (2认同)
  • 注意:对于整数权重,此函数的结果将不同于“将每个值重复 k 次,其中 k 是权重”的更简单(或“正确”,具体取决于定义)的方法,因为它在单个值之间进行插值点(重量为 k)而不是 k 个相同高度的点。例如,如果values=[1, 2]且sample_weight=[1, 3],则加权中位数为1.75,但[1,2,2,2]的未加权中位数将为2。 (2认同)

eus*_*iro 15

使用此参考进行加权百分位数方法更清晰、更简单。

import numpy as np

def weighted_percentile(data, weights, perc):
    """
    perc : percentile in [0-1]!
    """
    ix = np.argsort(data)
    data = data[ix] # sort data
    weights = weights[ix] # sort weights
    cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
    return np.interp(perc, cdf, data)
Run Code Online (Sandbox Code Playgroud)


Sam*_* A. 15

这似乎现在已在 statsmodels 中实现

from statsmodels.stats.weightstats import DescrStatsW
wq = DescrStatsW(data=np.array([1, 2, 9, 3.2, 4]), weights=np.array([0.0, 0.5, 1.0, 0.3, 0.5]))
wq.quantile(probs=np.array([0.1, 0.9]), return_pandas=False)
# array([2., 9.])
Run Code Online (Sandbox Code Playgroud)

DescrStatsW 对象还实现了其他方法,例如加权平均值等。https://www.statsmodels.org/stable/ generated/statsmodels.stats.weightstats.DescrStatsW.html


小智 9

快速解决方案,首先排序然后插值:

def weighted_percentile(data, percents, weights=None):
    ''' percents in units of 1%
        weights specifies the frequency (count) of data.
    '''
    if weights is None:
        return np.percentile(data, percents)
    ind=np.argsort(data)
    d=data[ind]
    w=weights[ind]
    p=1.*w.cumsum()/w.sum()*100
    y=np.interp(percents, p, d)
    return y
Run Code Online (Sandbox Code Playgroud)

  • 这会为`weighted_percentile(np.array([0,3,6,9]),50,weights = np.array([1,3,3,1]))和`weighted_percentile(np.array)产生不同的结果([0,3,3,3,6,6,6,9]),50,权重=无)` (3认同)

HYR*_*YRY 8

我不知道加权百分位是什么意思,但是从@Joan Smith的回答来看,你似乎只需要重复每一个元素ar,你可以使用numpy.repeat():

import numpy as np
np.repeat([1,2,3], [4,5,6])
Run Code Online (Sandbox Code Playgroud)

结果是:

array([1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3])
Run Code Online (Sandbox Code Playgroud)

  • 但是,这仅支持整数权重.对于更大的数据集,很可能是非常大的内存和CPU时间. (11认同)
  • 从我个人的经验来看,我可以确认这种方法肯定效率不高。如果你的向量很长并且权重很大,你的计算机很快就会达到内存限制。 (3认同)
  • 我想这是更好(更有效)的答案。 (2认同)

gro*_*uck 6

为额外的(非原创)答案道歉(没有足够的代表对@nayyarv的评论).他的解决方案对我有用(即它复制了默认行为np.percentage),但我认为你可以通过原始np.percentage编写方式的线索来消除for循环.

def weighted_percentile(a, q=np.array([75, 25]), w=None):
    """
    Calculates percentiles associated with a (possibly weighted) array

    Parameters
    ----------
    a : array-like
        The input array from which to calculate percents
    q : array-like
        The percentiles to calculate (0.0 - 100.0)
    w : array-like, optional
        The weights to assign to values of a.  Equal weighting if None
        is specified

    Returns
    -------
    values : np.array
        The values associated with the specified percentiles.  
    """
    # Standardize and sort based on values in a
    q = np.array(q) / 100.0
    if w is None:
        w = np.ones(a.size)
    idx = np.argsort(a)
    a_sort = a[idx]
    w_sort = w[idx]

    # Get the cumulative sum of weights
    ecdf = np.cumsum(w_sort)

    # Find the percentile index positions associated with the percentiles
    p = q * (w.sum() - 1)

    # Find the bounding indices (both low and high)
    idx_low = np.searchsorted(ecdf, p, side='right')
    idx_high = np.searchsorted(ecdf, p + 1, side='right')
    idx_high[idx_high > ecdf.size - 1] = ecdf.size - 1

    # Calculate the weights 
    weights_high = p - np.floor(p)
    weights_low = 1.0 - weights_high

    # Extract the low/high indexes and multiply by the corresponding weights
    x1 = np.take(a_sort, idx_low) * weights_low
    x2 = np.take(a_sort, idx_high) * weights_high

    # Return the average
    return np.add(x1, x2)

# Sample data
a = np.array([1.0, 2.0, 9.0, 3.2, 4.0], dtype=np.float)
w = np.array([2.0, 1.0, 3.0, 4.0, 1.0], dtype=np.float)

# Make an unweighted "copy" of a for testing
a2 = np.repeat(a, w.astype(np.int))

# Tests with different percentiles chosen
q1 = np.linspace(0.0, 100.0, 11)
q2 = np.linspace(5.0, 95.0, 10)
q3 = np.linspace(4.0, 94.0, 10)
for q in (q1, q2, q3):
    assert np.all(weighted_percentile(a, q, w) == np.percentile(a2, q))
Run Code Online (Sandbox Code Playgroud)


小智 1

不幸的是,numpy没有内置的加权函数,但是,你可以随时把东西放在一起.

def weight_array(ar, weights):
     zipped = zip(ar, weights)
     weighted = []
     for a, w in zipped:
         for j in range(w):
             weighted.append(a)
     return weighted


np.percentile(weight_array(ar, weights), 25)
Run Code Online (Sandbox Code Playgroud)

  • 你假设权重是整数 (29认同)
  • 此外,它可能会分别使用大量过剩的内存和CPU时间进行存储和排序.不适合大量数据. (11认同)