我需要将一个字符串拆分成单词,还要得到单词的起始和结束偏移量.因此,例如,如果输入字符串是:
input_string = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
Run Code Online (Sandbox Code Playgroud)
我想得到:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
Run Code Online (Sandbox Code Playgroud)
我有一些使用input_string.split执行此操作的代码并调用.index,但速度很慢.我尝试通过手动迭代字符串来编写代码,但这仍然比较慢.有没有人有这个快速算法?
这是我的两个版本:
def using_split(line):
words = line.split()
offsets = []
running_offset = 0
for word in words:
word_offset = line.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
offsets.append((word, word_offset, running_offset - 1))
return offsets
def manual_iteration(line):
start = 0
offsets = []
word = ''
for off, char in enumerate(line + ' '):
if char in ' \t\r\n':
if off > start:
offsets.append((word, start, off - 1))
start = off + 1
word = ''
else:
word += char
return offsets
Run Code Online (Sandbox Code Playgroud)
通过使用timeit,"using_split"是最快的,接着是"manual_iteration",然后到目前为止最慢的是使用re.finditer,如下所示.
NPE*_*NPE 20
以下将这样做:
import re
s = 'ONE ONE ONE \t TWO TWO ONE TWO TWO THREE'
ret = [(m.group(0), m.start(), m.end() - 1) for m in re.finditer(r'\S+', s)]
print(ret)
Run Code Online (Sandbox Code Playgroud)
这会产生:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
Run Code Online (Sandbox Code Playgroud)
以下运行速度稍快 - 节省约30%.我所做的就是提前定义功能:
def using_split2(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append((word, word_offset, running_offset - 1))
return offsets
Run Code Online (Sandbox Code Playgroud)
def split_span(s):
for match in re.finditer(r"\S+", s):
span = match.span()
yield match.group(0), span[0], span[1] - 1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3336 次 |
| 最近记录: |