python - 如何在不丢失pandas中记录顺序的情况下连接连续的行?

sun*_*ad1 0 python concatenation dataframe pandas

当连续行中变量的类别相同时,我试图根据分类变量连接两行。以下是我的数据,例如:

   SNo  user    Text
0   1   Sam     Hello
1   1   John    Hi
2   1   Sam     How are you?
3   1   John    I am good
4   1   John    How about you?
5   1   John    How is it going?
6   1   Sam     Its going good
7   1   Sam     Thanks
8   2   Mary    Howdy?
9   2   Jake    Hey!!
10  2   Jake    What a surprise
11  2   Mary    Good to see you here :)
12  2   Jake    Ha ha. Hectic life
13  2   Mary    I know right..
14  2   Mary    How's Amy doing?
15  2   Mary    How are the kids?
16  2   Jake    All is good! :)
Run Code Online (Sandbox Code Playgroud)

在这里,如果我以前的user列值与我当前的user列值相同但与该列中的下一个值不同,那么我会Text为该用户连接列中的值。我需要这样做,直到该特定用户不再出现多次。下面给出了一个示例输出:

SNo user    Text
1   Sam     Hello
1   John    Hi
1   Sam     How are you?
1   John    I am good-How about you?-How is it going?
1   Sam     Its going good-Thanks
2   Mary    Howdy?
2   Jake    Hey!!-What a surprise
2   Mary    Good to see you here :)
2   Jake    Ha ha. Hectic life
2   Mary    I know right..-How's Amy doing?-How are the kids?
2   Jake    All is good! :)
Run Code Online (Sandbox Code Playgroud)

我尝试使用df.groupby()然后.agg()完成连接,但无法对其应用上述条件。因此,输出是将用户的所有出现组合起来进行聊天。

df = sample_data.groupby(["SNo","user"]).agg({'Text': '-'.join}).reset_index() # incorrect though
df
Run Code Online (Sandbox Code Playgroud)

此外,我试图避免for像瘟疫一样循环并尝试矢量化解决方案。


样本数据 :

data_dict = {'S. No.': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], 'user': ['Sam', 'John', 'Sam', 'John', 'John', 'John', 'Sam', 'Sam', 'Mary', 'Jake', 'Jake', 'Mary', 'Jake ', 'Mary', 'Mary', 'Mary', 'Jake'], 'Text': ['Hello', 'Hi', 'How are you?', 'I am good', 'How about you?', 'How is it going?', 'Its going good', 'Thanks', 'Howdy?', 'Hey!!', 'What a surprise', 'Good to see you here :)', 'Ha ha. Hectic life', 'I know right..', "How's Amy doing?", 'How are the kids?', 'All is good! :)']}

sample_data = pd.DataFrame(data_dict)
Run Code Online (Sandbox Code Playgroud)

Qua*_*ang 5

您想user与它进行比较shiftcumsum进行更改。然后你可以分组:

blocks = df['user'].ne(df['user'].shift()).cumsum()
(df.groupby(['SNo', blocks])
  .agg({'user':'first','Text': '-'.join})
  .reset_index('user', drop=True)
)
Run Code Online (Sandbox Code Playgroud)

输出:

     user                                               Text
SNo                                                         
1     Sam                                              Hello
1    John                                                 Hi
1     Sam                                       How are you?
1    John          I am good-How about you?-How is it going?
1     Sam                              Its going good-Thanks
2    Mary                                             Howdy?
2    Jake                              Hey!!-What a surprise
2    Mary                            Good to see you here :)
2    Jake                                 Ha ha. Hectic life
2    Mary  I know right..-How's Amy doing?-How are the kids?
2    Jake                                    All is good! :)
Run Code Online (Sandbox Code Playgroud)