add*_*ons 2 python pandas pandas-groupby
我有一个大型数据集,格式如下:
id, socialmedia
1, facebook
2, facebook
3, google
4, google
5, google
6, twitter
7, google
8, twitter
9, snapchat
10, twitter
11, facebook
Run Code Online (Sandbox Code Playgroud)
我想在那时进行分组并分配一个group_id列然后取消组合(展开)回到单个记录.
id, socialmedia, groupId
1, facebook, 1
2, facebook, 1
3, google, 2
4, google, 2
5, google, 2
6, twitter, 3
7, google, 2
8, twitter, 3
9, snapchat, 4
10, twitter, 3
11, facebook, 1
Run Code Online (Sandbox Code Playgroud)
我尝试了以下但最终使用'DataFrameGroupBy'对象不支持项目分配.
x['grpId'] = x.groupby('socialmedia')['socialmedia'].rank(method='dense').astype(int)
Run Code Online (Sandbox Code Playgroud)
通过使用 ngroup
df['grpId']=df.groupby(' socialmedia').ngroup().add(1)
df
Out[354]:
id socialmedia grpId
0 1 facebook 1
1 2 facebook 1
2 3 google 2
3 4 google 2
4 5 google 2
5 6 twitter 4
6 7 google 2
7 8 twitter 4
8 9 snapchat 3
9 10 twitter 4
10 11 facebook 1
Run Code Online (Sandbox Code Playgroud)
或者pd.factorize和'categroy'
df['grpId']=pd.factorize(df[' socialmedia'])[0]+1
df
Out[358]:
id socialmedia grpId
0 1 facebook 1
1 2 facebook 1
2 3 google 2
3 4 google 2
4 5 google 2
5 6 twitter 3
6 7 google 2
7 8 twitter 3
8 9 snapchat 4
9 10 twitter 3
10 11 facebook 1
Run Code Online (Sandbox Code Playgroud)
df['grpId']=df[' socialmedia'].astype('category').cat.codes.add(1)
df
Out[356]:
id socialmedia grpId
0 1 facebook 1
1 2 facebook 1
2 3 google 2
3 4 google 2
4 5 google 2
5 6 twitter 4
6 7 google 2
7 8 twitter 4
8 9 snapchat 3
9 10 twitter 4
10 11 facebook 1
Run Code Online (Sandbox Code Playgroud)
您可以使用sklearn.preprocessing.LabelEncoder方法:
In [79]: from sklearn.preprocessing import LabelEncoder
In [80]: le = LabelEncoder()
In [81]: df['groupId'] = le.fit_transform(df['socialmedia'])+1
In [82]: df
Out[82]:
id socialmedia groupId
0 1 facebook 1
1 2 facebook 1
2 3 google 2
3 4 google 2
4 5 google 2
5 6 twitter 4
6 7 google 2
7 8 twitter 4
8 9 snapchat 3
9 10 twitter 4
10 11 facebook 1
Run Code Online (Sandbox Code Playgroud)
我们还可以创建一个字典并映射它:
import pandas as pd
df = pd.DataFrame(dict(id=range(1,5),social=["Facebook","Twitter","Facebook","Google"]))
d = dict((k,v) for v,k in enumerate(df['social'].unique(),1))
df['groupid'] = df['social'].map(m)
print(df)
Run Code Online (Sandbox Code Playgroud)
退货
id social groupid
0 1 Facebook 1
1 2 Twitter 2
2 3 Facebook 1
3 4 Google 3
Run Code Online (Sandbox Code Playgroud)
或者像这样的一行:
df['groupid'] = df['social'].map({k:v for v,k in enumerate(df['social'].unique(),1)})
Run Code Online (Sandbox Code Playgroud)
时间:
%timeit df['grpId']=df.groupby('social').ngroup().add(1)
%timeit df['grpId']=pd.factorize(df['social'])[0]+1
%timeit df['grpId']=df['social'].astype('category').cat.codes.add(1)
%timeit df['groupid'] = df['social'].map(dict((k,v) for v,k in enumerate(df['social'].unique(),1)))
Run Code Online (Sandbox Code Playgroud)
退货
100 loops, best of 3: 1.5 ms per loop <- Wen1
1000 loops, best of 3: 493 µs per loop <- Wen2
1000 loops, best of 3: 990 µs per loop <- Wen3
1000 loops, best of 3: 802 µs per loop <- Antonvbr
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2058 次 |
| 最近记录: |