Python数据框：删除Python列中同一单元格中的重复单词

Question

Python数据框：删除Python列中同一单元格中的重复单词

Pin*_*ts0 3 python string dataframe pandas

下面显示的是我拥有的数据列，另一列是我想要的重复数据删除列。

老实说，我什至不知道如何在Python代码中开始这样做。我已经在R中阅读了几篇关于此的文章，但在Python中却没有。

Answer 1

如果您只想消除连续的重复项，那么就足够了：

df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

Run Code Online (Sandbox Code Playgroud)

细节

\b        # word boundary
(\w+)     # 1st capture group of a single word
( 
\s+       # 1 or more spaces
\1        # reference to first group 
)+        # one or more repeats
\b

Run Code Online (Sandbox Code Playgroud)

_{正则表达式从这里开始。}

要删除非连续的重复项，我建议一种涉及OrderedDict数据结构的解决方案：

from collections import OrderedDict

df['Desired'] = (df['Current'].str.split()
                              .apply(lambda x: OrderedDict.fromkeys(x).keys())
                              .str.join(' '))
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，2 月前
查看次数：	1893 次
最近记录：	6 年，10 月前