Pandas 中的条件词频计数

Question

Pandas 中的条件词频计数

Tao*_*Han 6 python string nlp dataframe pandas

我有一个如下所示的数据框：

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

Run Code Online (Sandbox Code Playgroud)

我想计算语音列中的单词数，但只计算预定义列表中的单词。例如，列表是：

wordlist = ['much', 'good','right']

Run Code Online (Sandbox Code Playgroud)

我想生成一个新列，显示每行中这三个单词的频率。因此，我的预期输出是：

     speaker                   speech                               words
0   Adam          Thank you very much and good afternoon.             2
1   Ben        Let me clarify that because I want to make sur...      1
2   Clair        By now you should have received a copy of our ...    1

Run Code Online (Sandbox Code Playgroud)

我试过：

df['total'] = 0
for word in df['speech'].str.split():
    if word in wordlist: 
        df['total'] += 1

Run Code Online (Sandbox Code Playgroud)

但是我运行它后，该total列始终为零。我想知道我的代码有什么问题？

Answer 1

CDJ*_*DJB 5

您可以使用以下矢量化方法：

data = {'speaker':['Adam','Ben','Clair'],
        'speech': ['Thank you very much and good afternoon.',
                   'Let me clarify that because I want to make sure we have got everything right',
                   'By now you should have some good rest']}
df = pd.DataFrame(data)

wordlist = ['much', 'good','right']

df['total'] = df['speech'].str.count(r'\b|\b'.join(wordlist))

Run Code Online (Sandbox Code Playgroud)

这使：

>>> df
  speaker                                             speech  total
0    Adam            Thank you very much and good afternoon.      2
1     Ben  Let me clarify that because I want to make sur...      1
2   Clair              By now you should have some good rest      1

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，9 月前
查看次数：	803 次
最近记录：	5 年，3 月前