从不同的选项分配值-熊猫

jon*_*boy 4 python pandas

我有一个df包含大量Places重复周期的时间段。这些Places都是随机开始和结束的。对于每个时间段,我想将每个唯一位置分配给一个Group。这样做的中心规则是:

1)每个人一次Group最多只能拥有3个唯一Places 的人

2)唯一性Places应该均匀地分布在每个Group

我已将的小部分内容进行了细分df。有7个唯一值(但一次最多出现5个)和2个值Groups可供选择。但是在实践中,df总共可能包含多达50个唯一值,这些值分别代表结束和结束以及不同的时间段,这些值最多分布在6个组中。

为了了解Places当前正在发生多少,我添加了Total,它基于Place再次出现的情况。

df包含所有可用Groups的每一个独特的Place每个Period。位置GolfClub将完成,但我们假定所有其他位置都将在以后出现时继续df

df = pd.DataFrame({
    'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6],  
    'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS'],                                
    'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5],                            
    'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1'],                           
    })
Run Code Online (Sandbox Code Playgroud)

引起我麻烦的主要问题是Places动态出现/存在。这样,它们完成了,新的开始了。因此,分配和分配当前的唯一Places需求以解决此概念

尝试:

def AssignPlace(df):
        uniquePlaces = df['Place'].unique()
        G3 = dict(zip(uniquePlaces, np.arange(len(uniquePlaces)) // 3 + 1))
        df['Assigned Group'] = df['Place'].map(G3)
        return df

df = df.groupby('Available Group', sort=False).apply(AssignPlace)
df = df.drop_duplicates(subset = ['Period','Place'])
Run Code Online (Sandbox Code Playgroud)

出:

    Period   Place  Total Available Group  Assigned Group
0   1       CLUB    1      1               1             
1   2       CLUB    1      2               1             
3   2       HOME    2      2               1             
5   2       AWAY    3      2               1             
7   3       WORK    4      2               2             
9   3       AWAY    4      1               1             
11  3       GOLF    5      1               2  #GOLF FINISHES SO 4 OCCURING FROM NEXT ROW            
13  4       CLUB    4      2               1  #CLUB FINISHES BUT POOL STARTS SO STILL 4 OCCURING FROM NEXT ROW           
15  4       POOL    4      2               2             
17  4       HOME    4      2               1             
19  5       WORK    4      2               2             
21  5       AWAY    4      1               1             
23  5       POOL    4      1               2             
25  6       TENNIS  5      2               3  #Signifies issue
Run Code Online (Sandbox Code Playgroud)

最后一行显示问题的开始。分配的组正确地将该位置作为第7个唯一值进行了度量,但它没有考虑当前的唯一值。作为ClubGolf光洁度,他们只有5当前unqiue值和2个可用的组。但是它正在返回Group 3。因此,将继续对每个新的唯一值进行计数,而不是考虑当前出现的唯一值。

预期输出,TENNIS分配组现在是1,而不是3

    Period   Place  Total Available Group  Assigned Group
0   1       CLUB    1      1               1             
1   2       CLUB    1      2               1             
3   2       HOME    2      2               1             
5   2       AWAY    3      2               1             
7   3       WORK    4      2               2             
9   3       AWAY    4      1               1             
11  3       GOLF    5      1               2             
13  4       CLUB    4      2               1             
15  4       POOL    4      2               2             
17  4       HOME    4      2               1             
19  5       WORK    4      2               2             
21  5       AWAY    4      1               1             
23  5       POOL    4      1               2             
25  6       TENNIS  5      2               1 
Run Code Online (Sandbox Code Playgroud)

Dev*_*dka 7

这是我的尝试。说明在代码注释上,如果不够,请在此处给我注释

注意:我在底部添加了5个虚拟行以模拟这些位置将在df中出现。所以请忽略句号= 0的行

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Period' : [1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5,6,6,0,0,0,0,0],  
    'Place' : ['CLUB','CLUB','CLUB','HOME','HOME','AWAY','AWAY','WORK','WORK','AWAY','AWAY','GOLF','GOLF','CLUB','CLUB','POOL','POOL','HOME','HOME','WORK','WORK','AWAY','AWAY','POOL','POOL','TENNIS','TENNIS', "AWAY","HOME","POOL","WORK", "TENNIS"],                                
#     'Total' : [1,1,1,2,2,3,3,4,4,4,4,5,5,4,4,4,4,4,4,4,4,4,4,4,4,5,5,0,0,0,0,0],                            
#     'Available Group' : ['1','2','1','2','1','2','1','2','1','1','2','1','2','2','1','2','1','2','1','2','1','1','2','1','2','2','1',0,0,0,0,0],                           
    })

# df to store all unique places
uniquePlaces = pd.DataFrame(df["Place"].unique(), columns=["Place"])
# Start stores index of df where the place appears 1st
uniquePlaces["Start"] = -1
# End stores index of df where the place appears last 
uniquePlaces["End"] = -1

## adds new column "Place Label" which is label encoded value for a place
## "Place Label" may not be necessary but it may improve performance when looking up and merging
## this function also updates Start and End of current label in group
def assign_place_label(group):
    label=uniquePlaces[uniquePlaces["Place"]==group.name].index[0]
    group["Place Label"] = label
    uniquePlaces.loc[label, "Start"] = group.index.min()
    uniquePlaces.loc[label, "End"] = group.index.max()
    return group

## based on Start and End of each place assign index to each place
## when a freed the index is reused to new place appearing after that
def get_dynamic_group(up):
    up["Index"] = 0
    up["Freed"] = False
    max_ind=0
    free_indx = []
    for i in range(len(up)):
        ind_freed = up.index[(up["End"]<up.iloc[i]["Start"]) & (~up["Freed"])]
        free = list(up.loc[ind_freed, "Index"])
        free_indx += free

        up.loc[ind_freed, "Freed"] = True


        if len(free_indx)>0:
            m = min(free_indx)
            up.loc[i, "Index"] = m
            free_indx.remove(m)

        else:
            up.loc[i, "Index"] = max_ind
            max_ind+=1

    up["Group"] = up["Index"]//3+1

    return up  

df2 = df.groupby("Place").apply(assign_place_label)
uniquePlaces = get_dynamic_group(uniquePlaces)

display(uniquePlaces)

df3 = df2[df2.Period!=0].drop_duplicates(subset = ['Period','Place'])
result = df3.merge(uniquePlaces[["Group"]], how="left", left_on="Place Label", 
                   right_index=True, sort=False)
display(result)
Run Code Online (Sandbox Code Playgroud)

输出量

    Period  Place   Place Label Group
0   1   CLUB    0   1
1   2   CLUB    0   1
3   2   HOME    1   1
5   2   AWAY    2   1
7   3   WORK    3   2
9   3   AWAY    2   1
11  3   GOLF    4   2
13  4   CLUB    0   1
15  4   POOL    5   2
17  4   HOME    1   1
19  5   WORK    3   2
21  5   AWAY    2   1
23  5   POOL    5   2
25  6   TENNIS  6   1
Run Code Online (Sandbox Code Playgroud)