pythonic 方法来计算列表/集合中的单词在数据帧列中出现的次数

Question

pythonic 方法来计算列表/集合中的单词在数据帧列中出现的次数

v_c*_*r12 3 python count dataframe pandas find-occurrences

给定一个列表/一组标签

labels = {'rectangle', 'square', 'triangle', 'cube'}

Run Code Online (Sandbox Code Playgroud)

和一个数据框 df，

df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])

Run Code Online (Sandbox Code Playgroud)

我想知道标签集中的每个单词在数据框的文本列中出现了多少次，并创建一个新列，其中包含前 X 个（可能是 2 或 3 个）最重复的单词。如果 2 个单词重复次数相同，那么它们可以出现在列表或字符串中

输出：

pd.DataFrame({'text' : ['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], 'best_labels' : [{'rectangle' : 2, 'square' : 1, 'cube' : 1}, {'triangle' : 1, 'circle' : 1}, np.nan]})                                                                                                                          
                                                                                                                      
df['best_labels'] = some_function(df.text)

Run Code Online (Sandbox Code Playgroud)

Answer 1

And*_*ely 5

from collections import Counter

labels = {'rectangle', 'square', 'triangle', 'cube'}    
df = pd.DataFrame(['rectangle rectangle in my square cube', 'triangle circle not here', 'nothing here'], columns=['text'])
    
df['best_labels'] = df.text.apply(lambda x: {k: v for k, v in Counter(x.split()).items() if k in labels} or np.nan)    
print(df)

Run Code Online (Sandbox Code Playgroud)

印刷：

                                    text                               best_labels
0  rectangle rectangle in my square cube  {'rectangle': 2, 'square': 1, 'cube': 1}
1               triangle circle not here                           {'triangle': 1}
2                           nothing here                                       NaN

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，6 月前
查看次数：	89 次
最近记录：	5 年，6 月前