Joh*_*ohn 1 python pandas pandas-groupby
我有一个大型数据集(15k +行),我试图根据投资者的数量(而非实际所有权)显示投资的比例份额.这是一个众所周知的故障,但我们正试图解决表示问题.我现在可以删除SQL中的重复项(如果我有3个投资600的客户,我会删除重复项,让1个客户600,但它是全部或全部,而是我想向这三个客户显示每个200) .
我需要groupby函数来考虑Customerid,parentid和投资金额.然后我需要创建另一个列,并为每个客户(每个Sponsorname /行)提供每个客户的平均投资(投资/#该客户ID /父ID组合的特定金额的客户).最后,我需要groupby,通过SponsorName总结投资并计算客户ID.
数据集:
CustomerID ParentID SponsorName Investment
1 55 Bob 600
1 55 Jack 600
1 55 Mary 600
5 65 Bill 1200
5 65 Jim 1200
5 65 Jill 1200
1 55 Bob 1000
1 55 Jack 1000
1 55 Mary 1000
Run Code Online (Sandbox Code Playgroud)
输出:
CustomerID ParentID SponsorName Investment Avg Investment
1 55 Bob 600 200
1 55 Jack 600 200
1 55 Mary 600 200
5 65 Bill 1200 400
5 65 Jim 1200 400
5 65 Jill 1200 400
1 55 Bob 1000 333.33
1 55 Jack 1000 333.33
1 55 Mary 1000 333.33
Run Code Online (Sandbox Code Playgroud)
谢谢!
您可以使用GroupBy+ transform具有size:
counts = df.groupby(['CustomerID', 'ParentID'])['SponsorName'].transform('size')
df['Avg Investment'] = df['Investment'] / counts
Run Code Online (Sandbox Code Playgroud)
输出:
CustomerID ParentID SponsorName Investment Avg Investment
0 1 55 Bob 600 200.0
1 1 55 Jack 600 200.0
2 1 55 Mary 600 200.0
3 5 65 Bill 1200 400.0
4 5 65 Jim 1200 400.0
5 5 65 Jill 1200 400.0
Run Code Online (Sandbox Code Playgroud)
更新了修改后的问题,@ScottBoston礼貌
group_keys = ['CustomerID', 'ParentID', 'Investment']
counts = df.groupby(group_keys)['SponsorName'].transform('size')
df['Avg Investment'] = df['Investment'] / counts
Run Code Online (Sandbox Code Playgroud)
输出:
CustomerID ParentID SponsorName Investment Avg Investment
0 1 55 Bob 600 200.000000
1 1 55 Jack 600 200.000000
2 1 55 Mary 600 200.000000
3 5 65 Bill 1200 400.000000
4 5 65 Jim 1200 400.000000
5 5 65 Jill 1200 400.000000
6 1 55 Bob 1000 333.333333
7 1 55 Jack 1000 333.333333
8 1 55 Mary 1000 333.333333
Run Code Online (Sandbox Code Playgroud)