假设我有一个0到100之间的巨大数字列表.我计算范围,取决于最大数量,然后说有10个箱子.所以我的范围是例如
ranges = [0,10,20,30,40,50,60,70,80,90,100]
Run Code Online (Sandbox Code Playgroud)
现在我计算0-10,10-20等每个范围内的出现次数.我遍历列表中的每个数字并检查范围.我认为这不是运行速度方面的最佳方式.
我可以通过使用熊猫来加强它,例如pandas.groupby,以及如何使用它?
cs9*_*s95 18
Surprised I haven't seen this yet, so without further ado, here is
.value_counts(bins=N)Computing bins with pd.cut followed by a groupBy is a 2-step process. value_counts allows you a shortcut using the bins argument:
# Uses Ed Chum's setup. Cross check our answers match!
np.random.seed(0)
df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)})
df['a'].value_counts(bins=10, sort=False)
(0.9, 10.9] 11
(10.9, 20.8] 10
(20.8, 30.7] 8
(30.7, 40.6] 13
(40.6, 50.5] 11
(50.5, 60.4] 9
(60.4, 70.3] 10
(70.3, 80.2] 11
(80.2, 90.1] 13
(90.1, 100.0] 4
Name: a, dtype: int64
Run Code Online (Sandbox Code Playgroud)
This creates 10 evenly-spaced right-closed intervals and bincounts your data. sort=False will be required to avoid value_counts ordering the result in decreasing order of count.
For this, you can pass a list to bins argument:
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
df['a'].value_counts(bins=bins, sort=False)
(-0.001, 10.0] 11
(10.0, 20.0] 10
(20.0, 30.0] 8
(30.0, 40.0] 13
(40.0, 50.0] 11
(50.0, 60.0] 9
(60.0, 70.0] 10
(70.0, 80.0] 11
(80.0, 90.0] 13
(90.0, 100.0] 4
Name: a, dtype: int64
Run Code Online (Sandbox Code Playgroud)
EdC*_*ica 14
In [82]:
df = pd.DataFrame({"a": np.random.random_integers(0, high=100, size=100)})
ranges = [0,10,20,30,40,50,60,70,80,90,100]
df.groupby(pd.cut(df.a, ranges)).count()
Out[82]:
a
a
(0, 10] 10
(10, 20] 6
(20, 30] 12
(30, 40] 9
(40, 50] 11
(50, 60] 12
(60, 70] 9
(70, 80] 13
(80, 90] 9
(90, 100] 9
Run Code Online (Sandbox Code Playgroud)