use*_*ser 4 python regex string
关于使用正则表达式从字符串中剥离非字母数字字符有几个问题.我想要做的是删除第一个不是字母或单个空格的字符(包括数字和双空格)后的每个字符,包括字母.
例如:
My string is #not very beautiful
Run Code Online (Sandbox Code Playgroud)
应该成为
My string is
Run Code Online (Sandbox Code Playgroud)
要么
Are you 9 years old?
Run Code Online (Sandbox Code Playgroud)
应该成为
Are you
Run Code Online (Sandbox Code Playgroud)
和
this is the last example
Run Code Online (Sandbox Code Playgroud)
应该成为
this is the last
Run Code Online (Sandbox Code Playgroud)
我该如何做到这一点?
如何split开始[^A-Za-z ]|并采取第一个元素?您可以稍后修剪可能的空白区域:
import re
re.split("[^A-Za-z ]| ", "My string is #not very beautiful")[0].strip()
# 'My string is'
re.split("[^A-Za-z ]| ", "this is the last example")[0].strip()
# 'this is the last'
re.split("[^A-Za-z ]| ", "Are you 9 years old?")[0].strip()
# 'Are you'
Run Code Online (Sandbox Code Playgroud)
[^A-Za-z ]|包含两种模式,第一种模式是单个字符,既不是字母也不是空格; 第二种模式是双白空间; 拆分这两种模式中的一种,拆分后的第一个元素应该是您正在寻找的.
创建一个白名单,并在看到不在该白名单中的内容时停止:
import itertools
import string
def rstrip(s, whitelist=None):
if whitelist is None:
whitelist = set(string.ascii_letters + ' ') # set the whitelist to a default of all letters A-Z and a-z and a space
# split on double-whitespace and take the first split (this will work even if there's no double-whitespace in the string)
# use `itertools.takewhile` to include the characters that in the whitelist
# use `join` to join them inot one single string
return ''.join(itertools.takewhile(whitelist.__contains__, s.split(' ', 1)[0]))
Run Code Online (Sandbox Code Playgroud)