Pandas 数据框 - 如何消除列中的重复单词

Question

Pandas 数据框 - 如何消除列中的重复单词

我有一个熊猫数据框：

import pandas as pd

df = pd.DataFrame({'category':[0,1,2],
                   'text': ['this is some text for the first row',
                            'second row has this text',
                            'third row this is the text']})
df.head()

Run Code Online (Sandbox Code Playgroud)

我想得到以下结果（每行不重复单词）：

预期结果（对于上面的例子）：

category     text
0            is some for the first
1            second has
2            third is the

Run Code Online (Sandbox Code Playgroud)

使用以下代码，我尝试将行中的所有数据转换为字符串：

final_list =[]
for index, rows in df.iterrows():
    # Create list for the current row
    my_list =rows.text
    # append the list to the final list
    final_list.append(my_list)
# Print the list
print(final_list)
text=''

for i in range(len(final_list)):
    text+=final_list[i]+', '

print(text)

Run Code Online (Sandbox Code Playgroud)

这个问题中的想法（pandas 数据框-如何找到在每一行中重复的单词）并不能帮助我获得预期的结果。

arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)

Run Code Online (Sandbox Code Playgroud)

有谁知道如何获得它？

Answer 1

Shu*_*rma 3

您可以使用Series.str.split分隔符text空格分隔列，然后使用reduce获取所有行中找到的单词的交集，最后使用str.replace删除常见单词：

from functools import reduce

w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()

Run Code Online (Sandbox Code Playgroud)

   category                    text
0         0   is some for the first
1         1              second has
2         2            third is the

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，8 月前
查看次数：	132 次
最近记录：	5 年，8 月前