如何通过标点符号拆分熊猫列中的长字符串

Question

如何通过标点符号拆分熊猫列中的长字符串

我有一个 df 看起来像这样：

words                                              col_a   col_b  
I guess, because I have thought over that. Um,       1       0 
That? yeah.                                          1       1
I don't always think you're up to something.         0       1

Run Code Online (Sandbox Code Playgroud)

我想将 df.words 存在标点符号的地方拆分(.,?!:;)为单独的行。但是，我想为每个新行保留原始行中的 col_b 和 col_b 值。例如，上面的 df 应该是这样的：

words                                              col_a   col_b  
I guess,                                             1       0
because I have thought over that.                    1       0
Um,                                                  1       0 
That?                                                1       1
yeah.                                                1       1
I don't always think you're up to something.         0       1

Run Code Online (Sandbox Code Playgroud)

Answer 1

yat*_*atu 5

一种方法是使用str.findall模式(.*?[.,?!:;])来匹配任何这些标点符号和它前面的字符（非贪婪），并分解结果列表：

(df.assign(words=df.words.str.findall(r'(.*?[.,?!:;])'))
   .explode('words')
   .reset_index(drop=True))

                                          words  col_a  col_b
0                                      I guess,      1      0
1             because I have thought over that.      1      0
2                                           Um,      1      0
3                                         That?      1      1
4                                         yeah.      1      1
5  I don't always think you're up to something.      0      1

Run Code Online (Sandbox Code Playgroud)

我本来打算使用“split”，但这有效（-：抱歉，我的意思是说这更好。 (3认同)

归档时间：	6 年前
查看次数：	53 次
最近记录：	6 年前