我有一个字符串,如下所示:
this is "a test"
Run Code Online (Sandbox Code Playgroud)
我正在尝试用Python编写一些东西,用空格分割,同时忽略引号内的空格.我正在寻找的结果是:
['this','is','a test']
Run Code Online (Sandbox Code Playgroud)
PS.我知道你会问"如果报价中有引号会发生什么,那么,在我的申请中,这将永远不会发生.
Jer*_*rub 367
您希望从shlex模块拆分.
>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']
Run Code Online (Sandbox Code Playgroud)
这应该完全符合你的要求.
All*_*len 55
shlex
特别是看看模块shlex.split
.
>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']
Run Code Online (Sandbox Code Playgroud)
小智 37
我看到这里的正则表达式看起来很复杂和/或错误.这让我感到惊讶,因为正则表达式语法可以很容易地描述"空白或者被引用的东西包围",并且大多数正则表达式引擎(包括Python)可以在正则表达式上分割.所以如果你要使用正则表达式,为什么不直接说出你的意思呢?:
test = 'this is "a test"' # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
Run Code Online (Sandbox Code Playgroud)
说明:
[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
Run Code Online (Sandbox Code Playgroud)
然而,shlex可能提供更多功能.
Rya*_*rom 26
根据您的使用情况,您可能还想查看csv模块:
import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
print(row)
Run Code Online (Sandbox Code Playgroud)
输出:
['this', 'is', 'a string']
['and', 'more', 'stuff']
Run Code Online (Sandbox Code Playgroud)
Dan*_*Dai 14
我使用shlex.split处理70,000,000行鱿鱼日志,它太慢了.所以我改用了.
如果你有shlex的性能问题,请试试这个.
import re
def line_split(line):
return re.findall(r'[^"\s]\S*|".+?"', line)
Run Code Online (Sandbox Code Playgroud)
由于此问题标有正则表达式,我决定尝试使用正则表达式方法.我首先用\ x00替换引号部分中的所有空格,然后用空格分割,然后将\ x00替换回每个部分中的空格.
两个版本都做同样的事情,但拆分器比splitter2更具可读性.
import re
s = 'this is "a test" some text "another test"'
def splitter(s):
def replacer(m):
return m.group(0).replace(" ", "\x00")
parts = re.sub('".+?"', replacer, s).split()
parts = [p.replace("\x00", " ") for p in parts]
return parts
def splitter2(s):
return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]
print splitter2(s)
Run Code Online (Sandbox Code Playgroud)
不同答案的速度测试:
\n\nimport re\nimport shlex\nimport csv\n\nline = \'this is "a test"\'\n\n%timeit [p for p in re.split("( |\\\\\\".*?\\\\\\"|\'.*?\')", line) if p.strip()]\n100000 loops, best of 3: 5.17 \xc2\xb5s per loop\n\n%timeit re.findall(r\'[^"\\s]\\S*|".+?"\', line)\n100000 loops, best of 3: 2.88 \xc2\xb5s per loop\n\n%timeit list(csv.reader([line], delimiter=" "))\nThe slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.\n100000 loops, best of 3: 2.4 \xc2\xb5s per loop\n\n%timeit shlex.split(line)\n10000 loops, best of 3: 50.2 \xc2\xb5s per loop\n
Run Code Online (Sandbox Code Playgroud)\n
被接受的shlex
方法的主要问题是它不会忽略引用的子字符串之外的转义字符,并且在某些极端情况下会产生略微出乎意料的结果。
我有以下用例,其中我需要一个拆分函数来拆分输入字符串,以便保留单引号或双引号子字符串,并能够在此类子字符串中转义引号。不加引号的字符串中的引号不应与任何其他字符区别对待。一些具有预期输出的示例测试用例:
输入字符串 | 预期产出 ================================================ 'abc def' | ['abc', 'def'] "abc \\s def" | ['abc', '\\s', 'def'] '"abc def" ghi' | ['abc def', 'ghi'] "'abc def' ghi" | ['abc def', 'ghi'] '"abc \\" def" ghi' | ['abc " def', 'ghi'] "'abc \\' def' ghi" | ["abc 'def", 'ghi'] "'abc \\s def' ghi" | ['abc \\s def', 'ghi'] '"abc \\s def" ghi' | ['abc \\s def', 'ghi'] '"" 测试' | ['', '测试'] "'' 测试" | ['', '测试'] "abc'def" | [“abc'def”] "abc'def'" | [“abc'def'”] "abc'def' ghi" | ["abc'def'", 'ghi'] "abc'def'ghi" | [“abc'def'ghi”] 'abc"def' | ['abc"def'] 'abc"def"' | ['abc"def"'] 'abc"def" ghi' | ['abc"def"', 'ghi'] 'abc"def"ghi' | ['abc"def"ghi'] "r'AA' r'.*_xyz$'" | ["r'AA'", "r'.*_xyz$'"] 'abc"def ghi"' | ['abc"def ghi"'] 'abc"def ghi""jkl"' | ['abc"def ghi""jkl"'] 'a"b c"d"e"f"g h"' | ['a"b c"d"e"f"g h"'] 'c="ls /" 输入键' | ['c="ls /"', 'type', 'key'] "abc'def ghi'" | [“abc'def ghi'”] "c='ls /' 类型键" | ["c='ls /'", 'type', 'key']
我最终使用以下函数来拆分字符串,以便所有输入字符串的预期输出结果:
import re
def quoted_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]
Run Code Online (Sandbox Code Playgroud)
它不漂亮;但它有效。以下测试应用程序检查其他方法(shlex
以及csv
现在)和自定义拆分实现的结果:
#!/bin/python2.7
import csv
import re
import shlex
from timeit import timeit
def test_case(fn, s, expected):
try:
if fn(s) == expected:
print '[ OK ] %s -> %s' % (s, fn(s))
else:
print '[FAIL] %s -> %s' % (s, fn(s))
except Exception as e:
print '[FAIL] %s -> exception: %s' % (s, e)
def test_case_no_output(fn, s, expected):
try:
fn(s)
except:
pass
def test_split(fn, test_case_fn=test_case):
test_case_fn(fn, 'abc def', ['abc', 'def'])
test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
test_case_fn(fn, '"" test', ['', 'test'])
test_case_fn(fn, "'' test", ['', 'test'])
test_case_fn(fn, "abc'def", ["abc'def"])
test_case_fn(fn, "abc'def'", ["abc'def'"])
test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
test_case_fn(fn, 'abc"def', ['abc"def'])
test_case_fn(fn, 'abc"def"', ['abc"def"'])
test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])
test_case_fn(fn, 'abc"def ghi"', ['abc"def ghi"'])
test_case_fn(fn, 'abc"def ghi""jkl"', ['abc"def ghi""jkl"'])
test_case_fn(fn, 'a"b c"d"e"f"g h"', ['a"b c"d"e"f"g h"'])
test_case_fn(fn, 'c="ls /" type key', ['c="ls /"', 'type', 'key'])
test_case_fn(fn, "abc'def ghi'", ["abc'def ghi'"])
test_case_fn(fn, "c='ls /' type key", ["c='ls /'", 'type', 'key'])
def csv_split(s):
return list(csv.reader([s], delimiter=' '))[0]
def re_split(s):
def strip_quotes(s):
if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
return s[1:-1]
return s
return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]
if __name__ == '__main__':
print 'shlex\n'
test_split(shlex.split)
print
print 'csv\n'
test_split(csv_split)
print
print 're\n'
test_split(re_split)
print
iterations = 100
setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
def benchmark(method, code):
print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
benchmark('csv', 'test_split(csv_split, test_case_no_output)')
benchmark('re', 'test_split(re_split, test_case_no_output)')
Run Code Online (Sandbox Code Playgroud)
输出:
史莱克 [确定] abc def -> ['abc', 'def'] [失败] abc \s def -> ['abc', 's', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [确定] 'abc def' ghi -> ['abc def', 'ghi'] [ OK ] "abc \" def" ghi -> ['abc " def', 'ghi'] [失败] 'abc \' def' ghi -> 异常:没有结束引号 [确定] 'abc \s def' ghi -> ['abc \\s def', 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi'] [ OK ] "" 测试 -> ['', 'test'] [确定] '' 测试 -> ['', '测试'] [失败] abc'def -> 异常:没有结束语 [失败] abc'def' -> ['abcdef'] [失败] abc'def' ghi -> ['abcdef', 'ghi'] [失败] abc'def'ghi -> ['abcdefghi'] [失败] abc"def -> 异常:没有结束引号 [失败] abc"def" -> ['abcdef'] [失败] abc"def" ghi -> ['abcdef', 'ghi'] [失败] abc"def"ghi -> ['abcdefghi'] [失败] r'AA' r'.*_xyz$' -> ['rAA', 'r.*_xyz$'] [失败] abc"def ghi" -> ['abcdef ghi'] [失败] abc"def ghi""jkl" -> ['abcdef ghijkl'] [失败] a"b c"d"e"f"g h" -> ['ab cdefg h'] [失败] c="ls /" 输入键 -> ['c=ls /', 'type', 'key'] [失败] abc'def ghi' -> ['abcdef ghi'] [失败] c='ls /' 输入键 -> ['c=ls /', 'type', 'key'] 文件 [确定] abc def -> ['abc', 'def'] [确定] abc \s def -> ['abc', '\\s', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [失败] 'abc def' ghi -> ["'abc", "def'", 'ghi'] [失败] "abc \" def" ghi -> ['abc \\', 'def"', 'ghi'] [失败] 'abc \' def' ghi -> ["'abc", "\\'", "def'", 'ghi'] [失败] 'abc \s def' ghi -> ["'abc", '\\s', "def'", 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi'] [ OK ] "" 测试 -> ['', 'test'] [失败] '' 测试 -> ["''", '测试'] [确定] abc'def -> ["abc'def"] [确定] abc'def' -> ["abc'def'"] [确定] abc'def' ghi -> ["abc'def'", 'ghi'] [确定] abc'def'ghi -> ["abc'def'ghi"] [确定] abc"def -> ['abc"def'] [确定] abc"def" -> ['abc"def"'] [确定] abc"def" ghi -> ['abc"def"', 'ghi'] [确定] abc"def"ghi -> ['abc"def"ghi'] [确定] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"] [失败] abc"def ghi" -> ['abc"def', 'ghi"'] [失败] abc"def ghi""jkl" -> ['abc"def', 'ghi""jkl"'] [失败] a"b c"d"e"f"g h" -> ['a"b', 'c"d"e"f"g', 'h"'] [失败] c="ls /" 输入键 -> ['c="ls', '/"', 'type', 'key'] [失败] abc'def ghi' -> ["abc'def", "ghi'"] [失败] c='ls /' 输入键 -> ["c='ls", "/'", 'type', 'key'] 关于 [确定] abc def -> ['abc', 'def'] [确定] abc \s def -> ['abc', '\\s', 'def'] [ OK ] "abc def" ghi -> ['abc def', 'ghi'] [确定] 'abc def' ghi -> ['abc def', 'ghi'] [ OK ] "abc \" def" ghi -> ['abc " def', 'ghi'] [ OK ] 'abc \' def' ghi -> ["abc ' def", 'ghi'] [确定] 'abc \s def' ghi -> ['abc \\s def', 'ghi'] [ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi'] [ OK ] "" 测试 -> ['', 'test'] [确定] '' 测试 -> ['', '测试'] [确定] abc'def -> ["abc'def"] [确定] abc'def' -> ["abc'def'"] [确定] abc'def' ghi -> ["abc'def'", 'ghi'] [确定] abc'def'ghi -> ["abc'def'ghi"] [确定] abc"def -> ['abc"def'] [确定] abc"def" -> ['abc"def"'] [确定] abc"def" ghi -> ['abc"def"', 'ghi'] [确定] abc"def"ghi -> ['abc"def"ghi'] [确定] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"] [确定] abc"def ghi" -> ['abc"def ghi"'] [确定] abc"def ghi""jkl" -> ['abc"def ghi""jkl"'] [确定] a"b c"d"e"f"g h" -> ['a"b c"d"e"f"g h"'] [确定] c="ls /" 输入键 -> ['c="ls /"', 'type', 'key'] [确定] abc'def ghi' -> ["abc'def ghi'"] [确定] c='ls /' 类型键 -> ["c='ls /'", 'type', 'key'] shlex:每次迭代 0.335 毫秒 csv:每次迭代 0.036 毫秒 回复:每次迭代 0.068 毫秒
因此性能比 好得多shlex
,并且可以通过预编译正则表达式进一步提高,在这种情况下,它将优于该csv
方法。
似乎出于性能原因re
,速度更快。这是我使用最小贪心运算符保留外引号的解决方案:
re.findall("(?:\".*?\"|\S)+", s)
Run Code Online (Sandbox Code Playgroud)
结果:
['this', 'is', '"a test"']
Run Code Online (Sandbox Code Playgroud)
aaa"bla blub"bbb
由于这些标记没有用空格分隔,因此将类似的结构保留在一起。如果字符串包含转义字符,则可以这样进行匹配:
>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""
Run Code Online (Sandbox Code Playgroud)
请注意,这也""
通过\S
模式的一部分匹配空字符串。