Python 按层次结构后的多个分隔符拆分字符串

Ank*_*Ank 23 python regex string

我只想根据多个分隔符(如“and”、“&”和“-”)按顺序拆分字符串一次。例子:

'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']
Run Code Online (Sandbox Code Playgroud)

我尝试了下面的代码来实现这一点:

import re
re.split('and|&|-', string, maxsplit=1)
Run Code Online (Sandbox Code Playgroud)

它适用于除最后一种情况之外的所有情况。由于它不遵循层次结构,因此它返回最后一个:

'dsfsd - adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
Run Code Online (Sandbox Code Playgroud)

我怎样才能做到这一点?

Pra*_*adi 35

这对于单个正则表达式是不切实际的。你可以让它与负后视一起工作,但每个额外的分隔符都会变得非常复杂。使用简单的旧str.split()行和多行来做到这一点非常简单。您所要做的就是检查使用当前分隔符进行拆分是否会为您提供两个元素。如果是,那就是你的答案。如果没有,请转到下一个分隔符:

def split_new(inp, delims):
    for d in delims:
        result = inp.split(d, maxsplit=1)
        if len(result) == 2: return result

    return [inp] # If nothing worked, return the input
Run Code Online (Sandbox Code Playgroud)

要测试这个:

teststrs = ['121 34 adsfd' , 'dsfsd and adfd', 'dsfsd & adfd' , 'dsfsd - adfd' , 'dsfsd and adfd and adsfa' , 'dsfsd and adfd - adsfa' , 'dsfsd - adfd and adsfa' ]
for t in teststrs:
    print(repr(t), '->', split_new(t, ['and', '&', '-']))
Run Code Online (Sandbox Code Playgroud)

产出

'121 34 adsfd' -> ['121 34 adsfd']
'dsfsd and adfd' -> ['dsfsd ', ' adfd']
'dsfsd & adfd' -> ['dsfsd ', ' adfd']
'dsfsd - adfd' -> ['dsfsd ', ' adfd']
'dsfsd and adfd and adsfa' -> ['dsfsd ', ' adfd and adsfa']
'dsfsd and adfd - adsfa' -> ['dsfsd ', ' adfd - adsfa']
'dsfsd - adfd and adsfa' -> ['dsfsd - adfd ', ' adsfa']
Run Code Online (Sandbox Code Playgroud)

  • 简单、可读并且可以轻松添加更多分隔符。 (13认同)
  • 这。比接受的答案中的正则表达式好得多,如果一年后您必须修改它,这会让您讨厌自己。 (8认同)

And*_*ely 23

尝试:

import re

tests = [
    ["121 34 adsfd", ["121 34 adsfd"]],
    ["dsfsd and adfd", ["dsfsd ", " adfd"]],
    ["dsfsd & adfd", ["dsfsd ", " adfd"]],
    ["dsfsd - adfd", ["dsfsd ", " adfd"]],
    ["dsfsd and adfd and adsfa", ["dsfsd ", " adfd and adsfa"]],
    ["dsfsd and adfd - adsfa", ["dsfsd ", " adfd - adsfa"]],
    ["dsfsd - adfd and adsfa", ["dsfsd - adfd ", " adsfa"]],
]

for s, result in tests:
    res = re.split(r"and|&(?!.*and)|-(?!.*and|.*&)", s, maxsplit=1)
    print(res)
    assert res == result
Run Code Online (Sandbox Code Playgroud)

印刷:

['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']
Run Code Online (Sandbox Code Playgroud)

解释:

正则表达式and|&(?!.*and)|-(?!.*and|.*&)使用 3 种替代方法。

  1. 我们and总是匹配或:
  2. 我们&仅在没有and前进时才匹配(使用否定前瞻(?! )或:
  3. 我们-仅在没有and&领先时匹配。

我们在re.sub-> 仅在第一场比赛中使用此模式。

  • 循环中使用的正则表达式应在循环之前编译。总时间将减少约25%。 (3认同)

Aja*_*234 5

您可以保留分隔符列表,按其值排序。然后,您可以结合re.split使用re.findall,仅使用后者生成的分隔符,这些分隔符在分割中最不有价值,根据以下排名ops

import re
def split_order(s):
   r, ops = re.findall('(?<=\s)and(?=\s)|\&|\-', s), ['and', '&', '-']
   m = -1 if not r else min([ops.index(i) for i in r])
   a, *b = re.split('|'.join(l:=[i for i in r if ops.index(i) == m]), s)
   return [s] if not l else ([a] if not b else [a, s[len(a)+len(l[0]):]])


vals = ['121 34 adsfd' , 'dsfsd and adfd', 'dsfsd & adfd' , 'dsfsd - adfd' , 'dsfsd and adfd and adsfa' , 'dsfsd and adfd - adsfa' , 'dsfsd - adfd and adsfa' ]
for i in vals:
   print(split_order(i))
Run Code Online (Sandbox Code Playgroud)

输出:

['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']
Run Code Online (Sandbox Code Playgroud)