Vij*_*iya 9 python pandas pandas-groupby
我有DF:
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
data_frame
Run Code Online (Sandbox Code Playgroud)
OP:
Name ID Manager_name Manager_ID
0 John 144 Smith 200
1 Mia 220 John 144
2 Caleb 155 Smith 200
3 Smith 200 Jason 500
Run Code Online (Sandbox Code Playgroud)
我正在尝试计算“名称”列中每个人下报告的人数。
逻辑是:
计算单个报告的人数以及该链中报告的人数。例如史密斯;约翰和卡勒布向史密斯报告,所以2 + 1,而米娅向约翰报告(他已经向史密斯报告),所以总数为3。
对于杰森-> 1同样,因为史密斯向他报告,并且3个人已经向史密斯报告,所以总数为4。
我知道如何以某种递归的方式Python地做到这一点,有没有一种方法可以在Pandas中有效地做到这一点。有什么建议么?
预期的OP:
Name Number of people reporting
John 1
Mia 0
Caleb 0
Smith 3
Jason 4
Run Code Online (Sandbox Code Playgroud)
Scott Boston的Networkx解决方案是首选解决方案...
有两个解决方案。第一个是矢量化的熊猫类型的解决方案,应该能够在较大的数据集上快速运行,第二个是pythonic且不能在OP所寻找的数据集大小上很好地工作,原始df大小为(223635,4)。
- 潘达斯解决方案
此问题旨在找出组织中每个人管理的人数,包括下属的下属。此解决方案将通过添加作为前几列的管理者的连续列,然后计算该数据框中每个雇员的出现次数,以确定其下的总数,来创建数据框。
首先,我们设置输入。
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
df = df[["SID", "Manager_SID"]]
# shortening the columns for convenience
df.columns = ["1", "2"]
print(df)
1 2
0 144 200
1 220 144
2 155 200
3 200 500
Run Code Online (Sandbox Code Playgroud)
首先,必须对没有下属的员工进行盘点,并放入单独的词典中。
df_not_mngr = df.loc[~df['1'].isin(df['2']), '1']
non_mngr_dict = {str(key):0 for key in df_not_mngr.values}
non_mngr_dict
{'220': 0, '155': 0}
Run Code Online (Sandbox Code Playgroud)
接下来,我们将通过添加上一列的管理器列来修改数据框。最右边的列中没有员工时,循环将停止
for i in range(2, 10):
df = df.merge(
df[["1", "2"]], how="left", left_on=str(i), right_on="1", suffixes=("_l", "_r")
).drop("1_r", axis=1)
df.columns = [str(x) for x in range(1, i + 2)]
if df.iloc[:, -1].isnull().all():
break
else:
continue
print(df)
1 2 3 4 5
0 144 200 500 NaN NaN
1 220 144 200 500 NaN
2 155 200 500 NaN NaN
3 200 500 NaN NaN NaN
Run Code Online (Sandbox Code Playgroud)
除第一列外的所有列均被折叠,并对每个员工进行计数并添加到词典中。
from collections import Counter
result = dict(Counter(df.iloc[:, 1:].values.flatten()))
Run Code Online (Sandbox Code Playgroud)
非管理员字典将添加到结果中。
result.update(non_mngr_dict)
result
{'200': 3, '500': 4, nan: 8, '144': 1, '220': 0, '155': 0}
Run Code Online (Sandbox Code Playgroud)
- 递归热解
我认为这可能比您正在寻找的更加Python化。首先,我创建了一个列表“ all_sids”,以确保我们捕获了所有员工,因为每个列表中都不是全部。
import pandas as pd
import numpy as np
data = [
["John", "144", "Smith", "200"],
["Mia", "220", "John", "144"],
["Caleb", "155", "Smith", "200"],
["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])
all_sids = pd.unique(df[['SID', 'Manager_SID']].values.ravel('K'))
Run Code Online (Sandbox Code Playgroud)
然后创建数据透视表。
dfp = df.pivot_table(values='Name', index='SID', columns='Manager_SID', aggfunc='count')
Run Code Online (Sandbox Code Playgroud)
dfp
Manager_SID 144 200 500
SID
144 NaN 1.0 NaN
155 NaN 1.0 NaN
200 NaN NaN 1.0
220 1.0 NaN NaN
Run Code Online (Sandbox Code Playgroud)
然后,一个函数将通过数据透视表汇总所有报告。
def count_mngrs(SID, count=0):
if str(SID) not in dfp.columns:
return count
else:
count += dfp[str(SID)].sum()
sid_list = dfp[dfp[str(SID)].notnull()].index
for sid in sid_list:
count = count_mngrs(sid, count)
return count
Run Code Online (Sandbox Code Playgroud)
为每个雇员调用该函数并打印结果。
print('SID', ' Number of People Reporting')
for sid in all_sids:
print(sid, " " , int(count_mngrs(sid)))
Run Code Online (Sandbox Code Playgroud)
结果在下面,对不起,我有点懒惰地把名字和sids放在了一起。
SID Number of People Reporting
144 1
220 0
155 0
200 3
500 4
Run Code Online (Sandbox Code Playgroud)
期待看到更多的熊猫型解决方案!
这也是一个图形问题,您可以使用Networkx:
import networkx as nx
import pandas as pd
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
#create a directed graph object using nx.DiGraph
G = nx.from_pandas_edgelist(data_frame,
source='Name',
target='Manager_name',
create_using=nx.DiGraph())
#use nx.ancestors to get set of "ancenstor" nodes for each node in the directed graph
pd.DataFrame.from_dict({i:len(nx.ancestors(G,i)) for i in G.nodes()},
orient='index',
columns=['Num of People reporting'])
Run Code Online (Sandbox Code Playgroud)
输出:
Num of People reporting
John 1
Smith 3
Mia 0
Caleb 0
Jason 4
Run Code Online (Sandbox Code Playgroud)
绘制newtorkx: