Pandas 数据框 - 如何消除列中的重复单词

mar*_*ark 5 python dataframe pandas

我有一个熊猫数据框:

import pandas as pd

df = pd.DataFrame({'category':[0,1,2],
                   'text': ['this is some text for the first row',
                            'second row has this text',
                            'third row this is the text']})
df.head()
Run Code Online (Sandbox Code Playgroud)

我想得到以下结果(每行不重复单词):

预期结果(对于上面的例子):

category     text
0            is some for the first
1            second has
2            third is the
Run Code Online (Sandbox Code Playgroud)

使用以下代码,我尝试将行中的所有数据转换为字符串:

final_list =[]
for index, rows in df.iterrows():
    # Create list for the current row
    my_list =rows.text
    # append the list to the final list
    final_list.append(my_list)
# Print the list
print(final_list)
text=''

for i in range(len(final_list)):
    text+=final_list[i]+', '

print(text)
Run Code Online (Sandbox Code Playgroud)

这个问题中的想法(pandas 数据框-如何找到在每一行中重复的单词)并不能帮助我获得预期的结果。

arr = [set(x.split()) for x in text.split(',')]
mutual_words = set.intersection(*arr)
result = [list(x.difference(mutual_words)) for x in arr]
result = sum(result, [])
final_text = (", ").join(result)
print(final_text)
Run Code Online (Sandbox Code Playgroud)

有谁知道如何获得它?

Shu*_*rma 3

您可以使用Series.str.split分隔符text空格分隔列,然后使用reduce获取所有行中找到的单词的交集,最后使用str.replace删除常见单词:

from functools import reduce

w = reduce(lambda x, y: set(x) & set(y), df['text'].str.split())
df['text'] = df['text'].str.replace(rf"(\s*)(?:{'|'.join(w)})\s*", r'\1').str.strip()
Run Code Online (Sandbox Code Playgroud)
   category                    text
0         0   is some for the first
1         1              second has
2         2            third is the 
Run Code Online (Sandbox Code Playgroud)