Ath*_*R T 4 python regex dataframe pandas
我有一个 Pandas 数据框,如下所示:
ID Col.A
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
Run Code Online (Sandbox Code Playgroud)
我试图Col.A在单词dark或digits. 我想要的结果如下。
ID Col.A Col.B
28654 This is a dark chocolate which is sweet
39876 Sky is blue 1234 Sky is cloudy 3423
88776 Stars can be seen in the dark sky
35491 Schools are closed 4568 but shops are open
Run Code Online (Sandbox Code Playgroud)
我尝试将包含单词dark的行分组到数据帧,并将带有数字的行分组到另一个数据帧,然后相应地拆分它们。之后,我可以连接生成的数据帧以获得预期的结果。代码如下:
df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet',
'Sky is blue 1234 Sky is cloudy 3423',
'Stars can be seen in the dark sky',
'Schools are closed 4568 but shops are open']})
df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)
Run Code Online (Sandbox Code Playgroud)
得到的结果与预期的不同。那是,
0 1
0 This is a chocolate which is sweet
2 Stars can be seen in the sky
1 Sky is blue Sky is cloudy
3 Schools are closed but shops are open
Run Code Online (Sandbox Code Playgroud)
我错过了字符串中的数字和dark结果中的单词。
那么如何解决这个问题并获得结果而不遗漏拆分单词和数字呢?
有没有办法在不删除它们的情况下“在预期的单词或数字之前切片”?
Series.str.splits = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))
Run Code Online (Sandbox Code Playgroud)
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open
Run Code Online (Sandbox Code Playgroud)
正则表达式详细信息:
\s+ : 匹配任何空白字符一次或多次(?=\b(?:dark|\d+)\b) : 正向预测
\b : 防止部分匹配的字边界(?:dark|\d+): 非捕获组
dark : First Alternative 从字面上匹配字符暗\d+ : 匹配任何数字一次或多次的第二种选择\b : 防止部分匹配的字边界看网上 regex demo
使用您显示的样本,请尝试以下操作。使用str.extractPandas 的功能。简单的解释是使用提取函数并提及正则表达式来创建具有非贪婪匹配的第一个捕获组,第二组具有数字或暗字符串直到行的最后,并将其保存到 Col.A 和 Col.B 列中。
df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df
Run Code Online (Sandbox Code Playgroud)
显示示例输出如下:
ID Col.A Col.B
0 28654 This is a dark chocolate which is sweet
1 39876 Sky is blue 1234 Sky is cloudy 3423
2 88776 Stars can be seen in the dark sky
3 35491 Schools are closed 4568 but shops are open
Run Code Online (Sandbox Code Playgroud)