vik*_*kky 7 python pandas pandas-groupby
我有一个示例 DF,试图用升序排序索引替换列值列表:
DF:
df = pd.DataFrame(np.random.randint(0,10,size=(7,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Mango","Mango","Mango","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
a b c d1 d2 date
0 2 7 9 Apple Orange 2002-01-01
1 6 0 9 Mango lemon 2002-01-01
2 8 0 0 Apple lemon 2002-01-01
3 4 4 4 Mango Orange 2002-01-01
4 5 0 8 Mango lemon 2002-02-01
5 6 1 6 Mango Orange 2002-02-01
6 7 2 7 Apple lemon 2002-02-01
Run Code Online (Sandbox Code Playgroud)
第1步:
Group the DF by "date" column, sample group on "2002-01-01"
a b c d1 d2 date
0 2 7 9 Apple Orange 2002-01-01
1 6 0 9 Mango lemon 2002-01-01
2 8 0 0 Apple lemon 2002-01-01
3 4 4 4 Mango Orange 2002-01-01
Run Code Online (Sandbox Code Playgroud)
第2步:
在该组中,将列的值替换为["d1","d2"]
基于 的排序平均值的索引(而不是 DF 索引)c
。
例如在上面的组中 mean(c, d1="Apple") = [9+0]/2 => 4.5
,
mean(c, d1="Mango") = [9+4]/2 => 6.5
所以ascending sorted index
是Apple:0
和Mango:1
所以列的值d1
将被替换如下:
a b c d1 d2 date
0 2 7 9 0 Orange 2002-01-01
1 6 0 9 1 lemon 2002-01-01
2 8 0 0 0 lemon 2002-01-01
3 4 4 4 1 Orange 2002-01-01
Run Code Online (Sandbox Code Playgroud)
将此应用于整个df
. 我有一种遍历组和每一行的蛮力方法,任何有关更pandas
基础解决方案的建议都将有助于提高效率。
您可以使用pivot_table
和groupby.rank
来创建排名。之后使用map
将值分配回来
df1 = df.pivot_table('c', ['date','d1']).groupby(level=0).rank(method='dense')-1
df['d1'] = df[['date','d1']].agg(tuple, axis=1).map(df1.c).astype('int')
Out[255]:
a b c d1 d2 date
0 2 7 9 0 Orange 2002-01-01
1 6 0 9 1 lemon 2002-01-01
2 8 0 0 0 lemon 2002-01-01
3 4 4 4 1 Orange 2002-01-01
4 5 0 8 0 lemon 2002-02-01
5 6 1 6 0 Orange 2002-02-01
6 7 2 7 0 lemon 2002-02-01
Run Code Online (Sandbox Code Playgroud)
注意:组的2002-02-01
平均值相同7
,Mango
因此Apple
排名为全部0
归档时间: |
|
查看次数: |
83 次 |
最近记录: |