alv*_*vas 8 python sample percentile dataframe pandas
给定这样的数据集:
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
Run Code Online (Sandbox Code Playgroud)
[出去]:
key freq percent
0 ABC 100 0.328947
1 DEF 60 0.197368
2 GHI 50 0.164474
3 JKL 40 0.131579
4 MNO 13 0.042763
5 PQR 11 0.036184
6 STU 10 0.032895
7 VWX 10 0.032895
8 YZZ 3 0.009868
9 WHYQ 3 0.009868
10 HOWEE 2 0.006579
11 DUH 1 0.003289
12 HAHA 1 0.003289
Run Code Online (Sandbox Code Playgroud)
目标是
在这种情况下,合适的答案是:
['ABC', 'DEF']['GHI', 'JKL', 'MNO', 'PQR']['VWX', 'STU', 'YZZ', 'WHYQ', 'HOWEE', 'HAHA', 'DUH']我试过这个:
import random
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
bin_50_100 = []
bin_10_50 = []
bin_10 = []
total_percent = 1.0
for idx, row in df.sort_values(by=['freq', 'key'], ascending=False).iterrows():
if total_percent > 0.5:
bin_50_100.append(row['key'])
elif 0.1 < total_percent < 0.5:
bin_10_50.append(row['key'])
else:
bin_10.append(row['key'])
total_percent -= row['percent']
print(random.sample(bin_50_100, 1))
print(random.sample(bin_10_50, 2))
print(random.sample(bin_10, 4))
Run Code Online (Sandbox Code Playgroud)
[出去]:
['DEF']
['MNO', 'PQR']
['HOWEE', 'WHYQ', 'HAHA', 'DUH']
Run Code Online (Sandbox Code Playgroud)
但是有没有更简单的方法来解决这个问题?
咱们试试吧:
bins = [0, 0.1, 0.5, 1]
samples = [3,3,1]
df['sample'] = pd.cut(df.percent[::-1].cumsum(), # accumulate percentage
bins=[0, 0.1, 0.5, 1], # bins
labels=False # num samples
).astype(int)
df.groupby('sample').apply(lambda x: x.sample(n=samples[x['sample'].iloc[0])] )
Run Code Online (Sandbox Code Playgroud)
输出:
key freq percent sample
sample
1 0 ABC 100 0.328947 1
2 2 GHI 50 0.164474 2
5 PQR 11 0.036184 2
4 7 VWX 10 0.032895 4
6 STU 10 0.032895 4
12 HAHA 1 0.003289 4
10 HOWEE 2 0.006579 4
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
148 次 |
| 最近记录: |