如何在空格上拆分字符串并保留字的偏移量和长度

xor*_*yst 12 python string

我需要将一个字符串拆分成单词,还要得到单词的起始和结束偏移量.因此,例如,如果输入字符串是:

input_string = "ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE"
Run Code Online (Sandbox Code Playgroud)

我想得到:

[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
 ('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
Run Code Online (Sandbox Code Playgroud)

我有一些使用input_string.split执行此操作的代码并调用.index,但速度很慢.我尝试通过手动迭代字符串来编写代码,但这仍然比较慢.有没有人有这个快速算法?

这是我的两个版本:

def using_split(line):
    words = line.split()
    offsets = []
    running_offset = 0
    for word in words:
        word_offset = line.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        offsets.append((word, word_offset, running_offset - 1))

    return offsets

def manual_iteration(line):
    start = 0
    offsets = []
    word = ''
    for off, char in enumerate(line + ' '):
        if char in ' \t\r\n':
            if off > start:
                offsets.append((word, start, off - 1))
            start = off + 1
            word = ''
        else:
            word += char

    return offsets
Run Code Online (Sandbox Code Playgroud)

通过使用timeit,"using_split"是最快的,接着是"manual_iteration",然后到目前为止最慢的是使用re.finditer,如下所示.

NPE*_*NPE 20

以下将这样做:

import re
s = 'ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE'
ret = [(m.group(0), m.start(), m.end() - 1) for m in re.finditer(r'\S+', s)]
print(ret)
Run Code Online (Sandbox Code Playgroud)

这会产生:

[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
 ('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
Run Code Online (Sandbox Code Playgroud)

  • 很好,优雅的答案.原来是慢了:( (2认同)

aqu*_*tae 9

以下运行速度稍快 - 节省约30%.我所做的就是提前定义功能:

def using_split2(line, _len=len):
    words = line.split()
    index = line.index
    offsets = []
    append = offsets.append
    running_offset = 0
    for word in words:
        word_offset = index(word, running_offset)
        word_len = _len(word)
        running_offset = word_offset + word_len
        append((word, word_offset, running_offset - 1))
    return offsets
Run Code Online (Sandbox Code Playgroud)


Fre*_*Foo 7

def split_span(s):
    for match in re.finditer(r"\S+", s):
        span = match.span()
        yield match.group(0), span[0], span[1] - 1
Run Code Online (Sandbox Code Playgroud)