Pandas group by one column将其他列的值连接为分隔列表

Ali*_*Ali 5 python group-by pandas pandas-groupby

我想将所有资格(作为分隔符分隔列表)与作业标题分组.

在以下数据集中,相同类型的作业(.net开发人员)需要不同的资格集,而另一个作业不需要任何资格.

JobID    Job Title      Qualification ID Qualification Name
34455226 .Net Developer ICT50715         Diploma of Software Development
34455226 .Net Developer ICT40515         Certificate IV in Programming
34466933 .Net Developer ICT50715         Diploma of Software Development
34466111 .Net Developer ICT50655         Diploma of Software Testing
34479964 Snr Finance Systems Analyst 
Run Code Online (Sandbox Code Playgroud)

我想要一个关于特定类型工作可能需要的所有独特资格的综合视图,如下所示

Job Title                     Qualifications
.Net Developer                Diploma of Software Development,Certificate IV in Programming,Diploma of Software Testing
Snr Finance Systems Analyst   N/A
Run Code Online (Sandbox Code Playgroud)

这就是我到目前为止所做的.

def f(x):
 return pd.Series(dict(Qualifications = ",".join(map(str, x["Qualification Name"]))))

df_jobs_qualifications\
    .groupby("Job Title")[['Qualification Name']]\
    .apply(f)
Run Code Online (Sandbox Code Playgroud)

但它给了我重复的资格名称(见下文 - 软件开发文凭重复),而我想要独特的资格名称

Job Title                     Qualifications
.Net Developer                Diploma of Software Development,Certificate IV in Programming,Diploma of Software Development,Diploma of Software Testing
Snr Finance Systems Analyst   N/A
Run Code Online (Sandbox Code Playgroud)

UPDATE

我的问题与这个问题不同,因为即使遵循前面提到的问题中提到的步骤,我也没有获得独特的价值 在此输入图像描述

jez*_*ael 6

如果需要唯一的字符串 s:

你可以添加set或者unique如果可能的话添加一些Nones或NaNs dropna:

df1 = (df.groupby('Job Title')['Qualification Name']
       .apply(lambda x: ','.join(set(x.dropna())))
       .reset_index())

print (df1)
                     Job Title  \
0               .Net Developer   
1  Snr Finance Systems Analyst   

                                  Qualification Name  
0  Diploma of Software Development,Diploma of Sof...  
1     
Run Code Online (Sandbox Code Playgroud)

如果订单很重要:

df1 = (df.groupby('Job Title')['Qualification Name']
       .apply(lambda x: ','.join(x.dropna().unique()))
       .reset_index())

print (df1)
                     Job Title  \
0               .Net Developer   
1  Snr Finance Systems Analyst   

                                  Qualification Name  
0  Diploma of Software Development,Certificate IV...  
1                                                     
Run Code Online (Sandbox Code Playgroud)

如果想要NaN没有值:

def f(x):
    val = set(x.dropna())
    if len(val) > 0:
        val = ','.join(val)
    else:
        val = np.nan
    return val

df2 = df.groupby('Job Title')['Qualification Name'].apply(f).reset_index()
print (df2)
                     Job Title  \
0               .Net Developer   
1  Snr Finance Systems Analyst   

                                  Qualification Name  
0  Diploma of Software Development,Diploma of Sof...  
1                                                NaN  
Run Code Online (Sandbox Code Playgroud)

如果需要唯一列表:

df2 = (df.groupby('Job Title')['Qualification Name']
       .apply(lambda x: list(set(x)))
       .reset_index())

print (df2)
                     Job Title  \
0               .Net Developer   
1  Snr Finance Systems Analyst   

                                  Qualification Name  
0  [Diploma of Software Development, Diploma of S...  
1                                             [None]  

df2 = (df.groupby('Job Title')['Qualification Name']
       .apply(lambda x: list(x.unique()))
       .reset_index())

print (df2)
                     Job Title  \
0               .Net Developer   
1  Snr Finance Systems Analyst   

                                  Qualification Name  
0  [Diploma of Software Development, Certificate ...  
1                                             [None]  
Run Code Online (Sandbox Code Playgroud)