joh*_*doe 6 python dataframe pandas
我有一个df1这样的数据框
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Run Code Online (Sandbox Code Playgroud)
我想做的df2是,它只包含一个单词的所有单词df1及其计数(总出现次数),
我想对c1列求和并在其中添加新列df2(仅当该行中有一个单词时才求和)。
预期产量:
Word Totalcount Points
hello 2 2
world 1 1
how 1 1
are 1 1
you 1 1
people 3 1
I 1 1
am 1 1
fine 1 1
Good 2 -2
Morning 1 -1
Evening 1 -1
Run Code Online (Sandbox Code Playgroud)
首先提取塔由DataFrame.pop,Series.str.split,DataFrame.stack为Series和DataFrame.join原始,然后通过删除重复项DataFrame.drop_duplicates,并通过总GroupBy.agg用计数和sum:
s = (df.pop('text')
.str.split(expand=True)
.stack()
.reset_index(1, drop=True)
.rename('text'))
df1 = (df.join(s)
.reset_index(drop=True)
.drop_duplicates(['id','text'])
.groupby('text', sort=False)['c1']
.agg([('Totalcount','size'),('Points','sum')])
.reset_index()
.rename(columns={'text':'Word'}))
print (df1)
Word Totalcount Points
0 Hello 2 2
1 world 1 1
2 how 1 1
3 are 1 1
4 you 1 1
5 people 3 1
6 I 1 1
7 am 1 1
8 fine 1 1
9 Good 2 -2
10 Morning 1 -1
11 Evening 1 -1
Run Code Online (Sandbox Code Playgroud)
编辑:
为了获得更好的性能,请使用chain.from_iterable具有numpy.repeat:
from itertools import chain
splitted = [x.split() for x in df['text']]
lens = [len(x) for x in splitted]
df = pd.DataFrame({
'Word' : list(chain.from_iterable(splitted)),
'id' : df['id'].values.repeat(lens),
'c1' : df['c1'].values.repeat(lens)
})
df1 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)['c1']
.agg([('Totalcount','size'),('Points','sum')])
.reset_index())
print (df1)
Word Totalcount Points
0 Hello 2 2
1 world 1 1
2 how 1 1
3 are 1 1
4 you 1 1
5 people 3 1
6 I 1 1
7 am 1 1
8 fine 1 1
9 Good 2 -2
10 Morning 1 -1
11 Evening 1 -1
Run Code Online (Sandbox Code Playgroud)