我有一个df包含大量Places重复周期的时间段。这些Places都是随机开始和结束的。对于每个时间段,我想将每个唯一位置分配给一个Group。这样做的中心规则是:
1)每个人一次Group最多只能拥有3个唯一Places 的人
2)唯一性Places应该均匀地分布在每个Group
我已将的小部分内容进行了细分df。有7个唯一值(但一次最多出现5个)和2个值Groups可供选择。但是在实践中,df总共可能包含多达50个唯一值,这些值分别代表结束和结束以及不同的时间段,这些值最多分布在6个组中。
为了了解Places当前正在发生多少,我添加了Total,它基于Place再次出现的情况。
将df包含所有可用Groups的每一个独特的Place每个Period。位置Golf和Club将完成,但我们假定所有其他位置都将在以后出现时继续df。
df = pd.DataFrame({
'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6],
'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS'],
'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5],
'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1'],
})
Run Code Online (Sandbox Code Playgroud)
引起我麻烦的主要问题是Places动态出现/存在。这样,它们完成了,新的开始了。因此,分配和分配当前的唯一Places需求以解决此概念
尝试:
def AssignPlace(df):
uniquePlaces = df['Place'].unique()
G3 = dict(zip(uniquePlaces, np.arange(len(uniquePlaces)) // 3 + 1))
df['Assigned Group'] = df['Place'].map(G3)
return df
df = df.groupby('Available Group', sort=False).apply(AssignPlace)
df = df.drop_duplicates(subset = ['Period','Place'])
Run Code Online (Sandbox Code Playgroud)
出:
Period Place Total Available Group Assigned Group
0 1 CLUB 1 1 1
1 2 CLUB 1 2 1
3 2 HOME 2 2 1
5 2 AWAY 3 2 1
7 3 WORK 4 2 2
9 3 AWAY 4 1 1
11 3 GOLF 5 1 2 #GOLF FINISHES SO 4 OCCURING FROM NEXT ROW
13 4 CLUB 4 2 1 #CLUB FINISHES BUT POOL STARTS SO STILL 4 OCCURING FROM NEXT ROW
15 4 POOL 4 2 2
17 4 HOME 4 2 1
19 5 WORK 4 2 2
21 5 AWAY 4 1 1
23 5 POOL 4 1 2
25 6 TENNIS 5 2 3 #Signifies issue
Run Code Online (Sandbox Code Playgroud)
最后一行显示问题的开始。分配的组正确地将该位置作为第7个唯一值进行了度量,但它没有考虑当前的唯一值。作为Club和Golf光洁度,他们只有5当前unqiue值和2个可用的组。但是它正在返回Group 3。因此,将继续对每个新的唯一值进行计数,而不是考虑当前出现的唯一值。
预期输出,TENNIS分配组现在是1,而不是3:
Period Place Total Available Group Assigned Group
0 1 CLUB 1 1 1
1 2 CLUB 1 2 1
3 2 HOME 2 2 1
5 2 AWAY 3 2 1
7 3 WORK 4 2 2
9 3 AWAY 4 1 1
11 3 GOLF 5 1 2
13 4 CLUB 4 2 1
15 4 POOL 4 2 2
17 4 HOME 4 2 1
19 5 WORK 4 2 2
21 5 AWAY 4 1 1
23 5 POOL 4 1 2
25 6 TENNIS 5 2 1
Run Code Online (Sandbox Code Playgroud)
这是我的尝试。说明在代码注释上,如果不够,请在此处给我注释
注意:我在底部添加了5个虚拟行以模拟这些位置将在df中出现。所以请忽略句号= 0的行
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,0,0,0,0,0],
'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS', "AWAY","HOME","POOL","WORK", "TENNIS"],
# 'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5,0,0,0,0,0],
# 'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1',0,0,0,0,0],
})
# df to store all unique places
uniquePlaces = pd.DataFrame(df["Place"].unique(), columns=["Place"])
# Start stores index of df where the place appears 1st
uniquePlaces["Start"] = -1
# End stores index of df where the place appears last
uniquePlaces["End"] = -1
## adds new column "Place Label" which is label encoded value for a place
## "Place Label" may not be necessary but it may improve performance when looking up and merging
## this function also updates Start and End of current label in group
def assign_place_label(group):
label=uniquePlaces[uniquePlaces["Place"]==group.name].index[0]
group["Place Label"] = label
uniquePlaces.loc[label, "Start"] = group.index.min()
uniquePlaces.loc[label, "End"] = group.index.max()
return group
## based on Start and End of each place assign index to each place
## when a freed the index is reused to new place appearing after that
def get_dynamic_group(up):
up["Index"] = 0
up["Freed"] = False
max_ind=0
free_indx = []
for i in range(len(up)):
ind_freed = up.index[(up["End"]<up.iloc[i]["Start"]) & (~up["Freed"])]
free = list(up.loc[ind_freed, "Index"])
free_indx += free
up.loc[ind_freed, "Freed"] = True
if len(free_indx)>0:
m = min(free_indx)
up.loc[i, "Index"] = m
free_indx.remove(m)
else:
up.loc[i, "Index"] = max_ind
max_ind+=1
up["Group"] = up["Index"]//3+1
return up
df2 = df.groupby("Place").apply(assign_place_label)
uniquePlaces = get_dynamic_group(uniquePlaces)
display(uniquePlaces)
df3 = df2[df2.Period!=0].drop_duplicates(subset = ['Period','Place'])
result = df3.merge(uniquePlaces[["Group"]], how="left", left_on="Place Label",
right_index=True, sort=False)
display(result)
Run Code Online (Sandbox Code Playgroud)
输出量
Period Place Place Label Group
0 1 CLUB 0 1
1 2 CLUB 0 1
3 2 HOME 1 1
5 2 AWAY 2 1
7 3 WORK 3 2
9 3 AWAY 2 1
11 3 GOLF 4 2
13 4 CLUB 0 1
15 4 POOL 5 2
17 4 HOME 1 1
19 5 WORK 3 2
21 5 AWAY 2 1
23 5 POOL 5 2
25 6 TENNIS 6 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
217 次 |
| 最近记录: |