Ale*_*ana 14 python dataframe python-3.x pandas
我有一个数据框df,有两列:脚本(带文本)和扬声器
Script Speaker
aze Speaker 1
art Speaker 2
ghb Speaker 3
jka Speaker 1
tyc Speaker 1
avv Speaker 2
bhj Speaker 1
Run Code Online (Sandbox Code Playgroud)
我有以下清单: L = ['a','b','c']
使用以下代码,
df = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
.str.join('|')
.str.get_dummies()
.sum(level=0))
print (df)
Run Code Online (Sandbox Code Playgroud)
我获得了这个数据框df2:
Speaker a b c
Speaker 1 2 1 1
Speaker 2 2 0 0
Speaker 3 0 1 0
Run Code Online (Sandbox Code Playgroud)
我可以在我的代码中添加哪一行以获得数据帧df2的每一行,扬声器所说的所有行的百分比值,以获得以下数据帧df3:
Speaker a b c
Speaker 1 50% 25% 25%
Speaker 2 100% 0 0
Speaker 3 0 100% 0
Run Code Online (Sandbox Code Playgroud)
您可以sum沿第一个轴除以,然后转换为字符串并添加%:
out = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
.str.join('|')
.str.get_dummies()
.sum(level=0))
Run Code Online (Sandbox Code Playgroud)
(out/out.sum(0)[:,None]).mul(100).astype(int).astype(str).add('%')
a b c
Speaker
Speaker1 50% 25% 25%
Speaker2 100% 0% 0%
Speaker3 0% 100% 0%
Run Code Online (Sandbox Code Playgroud)
从您的原始数据帧开始,如果您想要 % 而不是分组 sum of dummies ,您可以更改整个脚本,如下所示:
m = df.set_index('Speaker')['Script'].str.findall('|'.join(L)) #creates a list of matches
m = m.explode().reset_index() #explode to a series
final = pd.crosstab(m['Speaker'],m['Script'],normalize='index').mul(100) # percentage pivot
Run Code Online (Sandbox Code Playgroud)
Script a b c
Speaker
Speaker 1 50.0 25.0 25.0
Speaker 2 100.0 0.0 0.0
Speaker 3 0.0 100.0 0.0
Run Code Online (Sandbox Code Playgroud)
如果您不想要百分比,请使用:
pd.crosstab(m['Speaker'],m['Script'])
Run Code Online (Sandbox Code Playgroud)
Script a b c
Speaker
Speaker 1 2 1 1
Speaker 2 2 0 0
Speaker 3 0 1 0
Run Code Online (Sandbox Code Playgroud)
注意:这里使用 pandas 0.25+ 作为版本