从numpy数字化计算垃圾箱的百分位数？

Question

从numpy数字化计算垃圾箱的百分位数？

Bob*_*nOG 2 python numpy histogram percentage pandas

我有一组数据，和一组用于创建垃圾箱的阈值：

data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)

Run Code Online (Sandbox Code Playgroud)

对于中的每个元素bins，我想知道基本百分位数。例如，在中bins，最小容器应从第0个百分点开始。然后是下一个垃圾箱，例如20％。因此，如果in中的值data落在0到20之间data，则它属于第一个bin。

我调查了熊猫，rank(pct=True)但似乎无法正确完成。

有什么建议吗？

Answer 1

Mar*_*ese 5

您可以按照上一个StackOverflow问题中的描述为数据数组中的每个元素计算百分比（将每个列表值映射到其相应的百分比）。

import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])

Run Code Online (Sandbox Code Playgroud)

方法1：使用scipy.stats.percentileofscore：

data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([  9.09090909,  18.18181818,  36.36363636,  36.36363636,
        36.36363636,  59.09090909,  59.09090909,  95.45454545,
        95.45454545,  72.72727273,  81.81818182])

Run Code Online (Sandbox Code Playgroud)

方法2：使用scipy.stats.rankdata并将其标准化为100（更快）：

ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([  9.09090909,  18.18181818,  36.36363636,  36.36363636,
        36.36363636,  59.09090909,  59.09090909,  95.45454545,
        95.45454545,  72.72727273,  81.81818182])

Run Code Online (Sandbox Code Playgroud)

现在有了百分位列表，您可以像以前一样使用numpy.digitize将它们进行装箱：

bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)

Run Code Online (Sandbox Code Playgroud)

这使您可以根据所选百分位列表的索引对数据进行分箱。如果需要，您还可以使用numpy.take返回实际的（较高）百分位数：

data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20,  20,  40,  40,  40,  60,  60, 100, 100,  80, 100])

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，3 月前
查看次数：	1904 次
最近记录：	9 年，3 月前