正则表达式在Python中拆分单词

Question

正则表达式在Python中拆分单词

我正在设计一个正则表达式来分割给定文本中的所有实际单词:

输入示例:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"

Run Code Online (Sandbox Code Playgroud)

预期产出:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]

Run Code Online (Sandbox Code Playgroud)

我想到了这样的正则表达式:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

Run Code Online (Sandbox Code Playgroud)

在Python中拆分后,结果包含None项和空格.

如何摆脱无物品？为什么空间不匹配？

编辑:
在空格上拆分,会给出以下项目:["there."]
并且在非字母上拆分,会给出以下项目:["John","s"]
除非拆分非字母,否则'会提供以下项目:["'Where","you'"]

Answer 1

Fal*_*gel 23

您可以使用字符串函数代替正则表达式:

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

Run Code Online (Sandbox Code Playgroud)

但是,在您的示例中,您不想删除撇号,John's但您希望将其删除you!!'.所以字符串操作在那一点上失败了,你需要一个精细调整的正则表达式.

编辑:可能一个简单的正则表达式可以解决你的问题:

(\w[\w']*)

Run Code Online (Sandbox Code Playgroud)

它将捕获以字母开头并继续捕获的所有字符,而下一个字符是撇号或字母.

(\w[\w']*\w)

Run Code Online (Sandbox Code Playgroud)

第二个正则表达式适用于非常具体的情况......第一个正则表达式可以捕获像这样的单词you'.这个将避免这种情况,只捕获撇号,如果是在单词内(不在开头或结尾).但在这一点上,情况就像是,你无法Moss' mom用第二个正则表达式捕获撇号.你必须决定是否将捕获尾随结束机智名撇号小号和界定所有权.

例:

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

Run Code Online (Sandbox Code Playgroud)

更新2:我在我的正则表达式中发现了一个错误!它不能捕获单个字母后面跟撇号一样A'.固定的全新正则表达式在这里:

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

Run Code Online (Sandbox Code Playgroud)

Answer 2

Mar*_*ers 7

你的正则表达式中有太多的捕获组; 让他们不捕获:

(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)

Run Code Online (Sandbox Code Playgroud)

演示:

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

Run Code Online (Sandbox Code Playgroud)

这只返回一个空元素.

归档时间：	13 年，2 月前
查看次数：	16561 次
最近记录：	12 年，7 月前