我正在尝试围绕可能包含也可能不包含特定单词的特定短语拆分字符串。我正在努力寻找正确的语法。
这是代码的当前版本:
import re
from pprint import pprint
text = """Here is a list: Bob talked to Caleb, and Caleb talked to Derek, and Derek talked to Eric, and Eric talked to Fred, and Fred talked to Greg, and Greg talked to Henry, and Henry talked to Isaac, and Isaac talked to Jesse, and Jesse talked to Ken."""
pprint(re.split(r"(a?n?d? ?\w+ talked to)",text))
Run Code Online (Sandbox Code Playgroud)
在此示例中,我想拆分“Bob 与之交谈”或“and Caleb 与之交谈”,因此,如果存在,则应包含 和 ,如果不存在,则应包含 和 。
这段代码产生(几乎正确):
['Here is a list:',
' Bob talked to',
' Caleb, ',
'and Caleb talked to',
' Derek, ',
'and Derek talked to',
' Eric, ',
'and Eric talked to',
' Fred, ',
'and Fred talked to',
' Greg, ',
'and Greg talked to',
' Henry, ',
'and Henry talked to',
' Isaac, ',
'and Isaac talked to',
' Jesse, ',
'and Jesse talked to',
' Ken.']
Run Code Online (Sandbox Code Playgroud)
唯一的小错误是“Bob”前面有一个空格,因为有一个“?”而被捕获。在正则表达式中。所以我不想要每个字母“a?n?d??”。我宁愿有“(和)?”
不幸的是,这些是结果:
print(re.split(r"((and )?\w+ talked to)",text))
Run Code Online (Sandbox Code Playgroud)
给我:
['Here is a list: ',
'Bob talked to',
None,
' Caleb, ',
'and Caleb talked to',
'and ',
' Derek, ',
'and Derek talked to',
'and ',
' Eric, ',
'and Eric talked to',
'and ',
' Fred, ',
'and Fred talked to',
'and ',
' Greg, ',
'and Greg talked to',
'and ',
' Henry, ',
'and Henry talked to',
'and ',
' Isaac, ',
'and Isaac talked to',
'and ',
' Jesse, ',
'and Jesse talked to',
'and ',
' Ken.']
Run Code Online (Sandbox Code Playgroud)
在这里,它分别寻找两个单位。我也许可以使用这个,但如果它是一个单元那就更好了。
另一种选择可能是:
pprint(re.split(r"([and ]?\w+ talked to)",text))
Run Code Online (Sandbox Code Playgroud)
给出:
['Here is a list:',
' Bob talked to',
' Caleb, and',
' Caleb talked to',
' Derek, and',
' Derek talked to',
' Eric, and',
' Eric talked to',
' Fred, and',
' Fred talked to',
' Greg, and',
' Greg talked to',
' Henry, and',
' Henry talked to',
' Isaac, and',
' Isaac talked to',
' Jesse, and',
' Jesse talked to',
' Ken.']
Run Code Online (Sandbox Code Playgroud)
在这种情况下,即使“and”可用,也不会包含在内。那么如何才能使“and”作为一个单元可选呢?换句话说,“and”要么是in或out,但不是部分in或out。
我想这就是你想要的:
((?:and )?\w+ talked to)
Run Code Online (Sandbox Code Playgroud)
这(?:and )是一个非捕获组,因此它匹配但未被捕获。