在Python中用空格分割字符串 - 保留引用的子字符串

Ada*_*rce 248 python regex

我有一个字符串,如下所示:

this is "a test"
Run Code Online (Sandbox Code Playgroud)

我正在尝试用Python编写一些东西,用空格分割,同时忽略引号内的空格.我正在寻找的结果是:

['this','is','a test']
Run Code Online (Sandbox Code Playgroud)

PS.我知道你会问"如果报价中有引号会发生什么,那么,在我的申请中,这将永远不会发生.

Jer*_*rub 367

您希望从shlex模块拆分.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']
Run Code Online (Sandbox Code Playgroud)

这应该完全符合你的要求.

  • 使用"posix = False"来保留报价.`shlex.split('this is"a test"',posix = False)`返回`['this','is','"a test"']` (10认同)

All*_*len 55

shlex特别是看看模块shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']
Run Code Online (Sandbox Code Playgroud)


小智 37

我看到这里的正则表达式看起来很复杂和/或错误.这让我感到惊讶,因为正则表达式语法可以很容易地描述"空白或者被引用的东西包围",并且大多数正则表达式引擎(包括Python)可以在正则表达式上分割.所以如果你要使用正则表达式,为什么不直接说出你的意思呢?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
Run Code Online (Sandbox Code Playgroud)

说明:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
Run Code Online (Sandbox Code Playgroud)

然而,shlex可能提供更多功能.

  • 为什么是三重反斜杠?一个简单的反斜杠不会做同样的事情吗? (3认同)
  • +1我正在使用它,因为它比shlex快得多. (2认同)
  • 使用正则表达式时应该使用原始字符串。 (2认同)

Rya*_*rom 26

根据您的使用情况,您可能还想查看csv模块:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)
Run Code Online (Sandbox Code Playgroud)

输出:

['this', 'is', 'a string']
['and', 'more', 'stuff']
Run Code Online (Sandbox Code Playgroud)

  • 有用,当 shlex 去除一些需要的字符时 (2认同)

Dan*_*Dai 14

我使用shlex.split处理70,000,000行鱿鱼日志,它太慢了.所以我改用了.

如果你有shlex的性能问题,请试试这个.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)
Run Code Online (Sandbox Code Playgroud)


eli*_*ner 8

由于此问题标有正则表达式,我决定尝试使用正则表达式方法.我首先用\ x00替换引号部分中的所有空格,然后用空格分割,然后将\ x00替换回每个部分中的空格.

两个版本都做同样的事情,但拆分器比splitter2更具可读性.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)
Run Code Online (Sandbox Code Playgroud)


har*_*777 7

不同答案的速度测试:

\n\n
import re\nimport shlex\nimport csv\n\nline = \'this is "a test"\'\n\n%timeit [p for p in re.split("( |\\\\\\".*?\\\\\\"|\'.*?\')", line) if p.strip()]\n100000 loops, best of 3: 5.17 \xc2\xb5s per loop\n\n%timeit re.findall(r\'[^"\\s]\\S*|".+?"\', line)\n100000 loops, best of 3: 2.88 \xc2\xb5s per loop\n\n%timeit list(csv.reader([line], delimiter=" "))\nThe slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.\n100000 loops, best of 3: 2.4 \xc2\xb5s per loop\n\n%timeit shlex.split(line)\n10000 loops, best of 3: 50.2 \xc2\xb5s per loop\n
Run Code Online (Sandbox Code Playgroud)\n


Ton*_*vel 6

被接受的shlex方法的主要问题是它不会忽略引用的子字符串之外的转义字符,并且在某些极端情况下会产生略微出乎意料的结果。

我有以下用例,其中我需要一个拆分函数来拆分输入字符串,以便保留单引号或双引号子字符串,并能够在此类子字符串中转义引号。不加引号的字符串中的引号不应与任何其他字符区别对待。一些具有预期输出的示例测试用例:

输入字符串 | 预期产出
================================================
 'abc def' | ['abc', 'def']
 "abc \\s def" | ['abc', '\\s', 'def']
 '"abc def" ghi' | ['abc def', 'ghi']
 "'abc def' ghi" | ['abc def', 'ghi']
 '"abc \\" def" ghi' | ['abc " def', 'ghi']
 "'abc \\' def' ghi" | ["abc 'def", 'ghi']
 "'abc \\s def' ghi" | ['abc \\s def', 'ghi']
 '"abc \\s def" ghi' | ['abc \\s def', 'ghi']
 '"" 测试' | ['', '测试']
 "'' 测试" | ['', '测试']
 "abc'def" | [“abc'def”]
 "abc'def'" | [“abc'def'”]
 "abc'def' ghi" | ["abc'def'", 'ghi']
 "abc'def'ghi" | [“abc'def'ghi”]
 'abc"def' | ['abc"def']
 'abc"def"' | ['abc"def"']
 'abc"def" ghi' | ['abc"def"', 'ghi']
 'abc"def"ghi' | ['abc"def"ghi']
 "r'AA' r'.*_xyz$'" | ["r'AA'", "r'.*_xyz$'"]
 'abc"def ghi"' | ['abc"def ghi"']
 'abc"def ghi""jkl"' | ['abc"def ghi""jkl"']
 'a"b c"d"e"f"g h"' | ['a"b c"d"e"f"g h"']
 'c="ls /" 输入键' | ['c="ls /"', 'type', 'key']
 "abc'def ghi'" | [“abc'def ghi'”]
 "c='ls /' 类型键" | ["c='ls /'", 'type', 'key']

我最终使用以下函数来拆分字符串,以便所有输入字符串的预期输出结果:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]
Run Code Online (Sandbox Code Playgroud)

它不漂亮;但它有效。以下测试应用程序检查其他方法(shlex以及csv现在)和自定义拆分实现的结果:

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])
    test_case_fn(fn, 'abc"def ghi"', ['abc"def ghi"'])
    test_case_fn(fn, 'abc"def ghi""jkl"', ['abc"def ghi""jkl"'])
    test_case_fn(fn, 'a"b c"d"e"f"g h"', ['a"b c"d"e"f"g h"'])
    test_case_fn(fn, 'c="ls /" type key', ['c="ls /"', 'type', 'key'])
    test_case_fn(fn, "abc'def ghi'", ["abc'def ghi'"])
    test_case_fn(fn, "c='ls /' type key", ["c='ls /'", 'type', 'key'])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'(?:[^"\s]*"(?:\\.|[^"])*"[^"\s]*)+|(?:[^\'\s]*\'(?:\\.|[^\'])*\'[^\'\s]*)+|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')
Run Code Online (Sandbox Code Playgroud)

输出:

史莱克

[确定] abc def -> ['abc', 'def']
[失败] abc \s def -> ['abc', 's', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[确定] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[失败] 'abc \' def' ghi -> 异常:没有结束引号
[确定] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" 测试 -> ['', 'test']
[确定] '' 测试 -> ['', '测试']
[失败] abc'def -> 异常:没有结束语
[失败] abc'def' -> ['abcdef']
[失败] abc'def' ghi -> ['abcdef', 'ghi']
[失败] abc'def'ghi -> ['abcdefghi']
[失败] abc"def -> 异常:没有结束引号
[失败] abc"def" -> ['abcdef']
[失败] abc"def" ghi -> ['abcdef', 'ghi']
[失败] abc"def"ghi -> ['abcdefghi']
[失败] r'AA' r'.*_xyz$' -> ['rAA', 'r.*_xyz$']
[失败] abc"def ghi" -> ['abcdef ghi']
[失败] abc"def ghi""jkl" -> ['abcdef ghijkl']
[失败] a"b c"d"e"f"g h" -> ['ab cdefg h']
[失败] c="ls /" 输入键 -> ['c=ls /', 'type', 'key']
[失败] abc'def ghi' -> ['abcdef ghi']
[失败] c='ls /' 输入键 -> ['c=ls /', 'type', 'key']

文件

[确定] abc def -> ['abc', 'def']
[确定] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[失败] 'abc def' ghi -> ["'abc", "def'", 'ghi']
[失败] "abc \" def" ghi -> ['abc \\', 'def"', 'ghi']
[失败] 'abc \' def' ghi -> ["'abc", "\\'", "def'", 'ghi']
[失败] 'abc \s def' ghi -> ["'abc", '\\s', "def'", 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" 测试 -> ['', 'test']
[失败] '' 测试 -> ["''", '测试']
[确定] abc'def -> ["abc'def"]
[确定] abc'def' -> ["abc'def'"]
[确定] abc'def' ghi -> ["abc'def'", 'ghi']
[确定] abc'def'ghi -> ["abc'def'ghi"]
[确定] abc"def -> ['abc"def']
[确定] abc"def" -> ['abc"def"']
[确定] abc"def" ghi -> ['abc"def"', 'ghi']
[确定] abc"def"ghi -> ['abc"def"ghi']
[确定] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]
[失败] abc"def ghi" -> ['abc"def', 'ghi"']
[失败] abc"def ghi""jkl" -> ['abc"def', 'ghi""jkl"']
[失败] a"b c"d"e"f"g h" -> ['a"b', 'c"d"e"f"g', 'h"']
[失败] c="ls /" 输入键 -> ['c="ls', '/"', 'type', 'key']
[失败] abc'def ghi' -> ["abc'def", "ghi'"]
[失败] c='ls /' 输入键 -> ["c='ls", "/'", 'type', 'key']

关于

[确定] abc def -> ['abc', 'def']
[确定] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[确定] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[ OK ] 'abc \' def' ghi -> ["abc ' def", 'ghi']
[确定] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" 测试 -> ['', 'test']
[确定] '' 测试 -> ['', '测试']
[确定] abc'def -> ["abc'def"]
[确定] abc'def' -> ["abc'def'"]
[确定] abc'def' ghi -> ["abc'def'", 'ghi']
[确定] abc'def'ghi -> ["abc'def'ghi"]
[确定] abc"def -> ['abc"def']
[确定] abc"def" -> ['abc"def"']
[确定] abc"def" ghi -> ['abc"def"', 'ghi']
[确定] abc"def"ghi -> ['abc"def"ghi']
[确定] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]
[确定] abc"def ghi" -> ['abc"def ghi"']
[确定] abc"def ghi""jkl" -> ['abc"def ghi""jkl"']
[确定] a"b c"d"e"f"g h" -> ['a"b c"d"e"f"g h"']
[确定] c="ls /" 输入键 -> ['c="ls /"', 'type', 'key']
[确定] abc'def ghi' -> ["abc'def ghi'"]
[确定] c='ls /' 类型键 -> ["c='ls /'", 'type', 'key']

shlex:每次迭代 0.335 毫秒
csv:每次迭代 0.036 毫秒
回复:每次迭代 0.068 毫秒

因此性能比 好得多shlex,并且可以通过预编译正则表达式进一步提高,在这种情况下,它将优于该csv方法。


hoc*_*chl 5

似乎出于性能原因re,速度更快。这是我使用最小贪心运算符保留外引号的解决方案:

re.findall("(?:\".*?\"|\S)+", s)
Run Code Online (Sandbox Code Playgroud)

结果:

['this', 'is', '"a test"']
Run Code Online (Sandbox Code Playgroud)

aaa"bla blub"bbb由于这些标记没有用空格分隔,因此将类似的结构保留在一起。如果字符串包含转义字符,则可以这样进行匹配:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""
Run Code Online (Sandbox Code Playgroud)

请注意,这也""通过\S模式的一部分匹配空字符串。