E. *_*nci 1 python dataframe pandas
我有一个像这样的熊猫数据框,
>>> data = {
'hotel_code': [1, 1, 1, 1, 1],
'feed': [1, 1, 1, 1, 2],
'price_euro': [100, 200, 250, 120, 130],
'client_nationality': ['fr', 'us', 'ru,de', 'gb', 'cn,us,br,il,fr,gb,de,ie,pk,pl']
}
>>> df = pd.DataFrame(data)
>>> df
hotel_code feed price_euro client_nationality
0 1 1 100 fr
1 1 1 200 us
2 1 1 250 ru,de
3 1 1 120 gb
4 1 2 130 cn,us,br,il,fr,gb,de,ie,pk,pl
Run Code Online (Sandbox Code Playgroud)
这是预期的输出,
>>> data = {
'hotel_code': [1, 1],
'feed': [1, 2],
'cluster1': ['fr', 'cn,us,br,il,fr,gb,de,ie,pk,pl'],
'cluster2': ['us', np.nan],
'cluster3': ['ru,de', np.nan],
'cluster4': ['gb', np.nan],
}
>>> df = pd.DataFrame(data)
>>> df
hotel_code feed cluster1 cluster2 cluster3 cluster4
0 1 1 fr us ru,de gb
1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl NaN NaN NaN
Run Code Online (Sandbox Code Playgroud)
我想通过 unique 创建簇列hotel_code,feed但我不知道。簇号是可变的。任何的想法?提前致谢。
使用GroupBy.cumcount每团体柜台,通过创建多指标hotel_code与feed和计数器Series通过与再塑Series.unstack,最后rename列和DataFrame.reset_index用于MultiIndex给列:
g = df.groupby(["hotel_code", "feed"]).cumcount()
df1 = (df.set_index(["hotel_code", "feed", g])['client_nationality']
.unstack()
.rename(columns = lambda x: f'cluster_{x+1}')
.reset_index())
print (df1)
hotel_code feed cluster_1 cluster_2 cluster_3 \
0 1 1 fr us ru,de
1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl NaN NaN
cluster_4
0 gb
1 NaN
Run Code Online (Sandbox Code Playgroud)