Python字符串搜索效率

sho*_*app 5 python performance

对于非常大的字符串(跨越多行),使用Python的内置字符串搜索或拆分大字符串(可能打开\n)并迭代搜索较小的字符串会更快吗?

例如,对于非常大的字符串:

for l in get_mother_of_all_strings().split('\n'):
 if 'target' in l:
   return True
return False
Run Code Online (Sandbox Code Playgroud)

要么

return 'target' in get_mother_of_all_strings()
Run Code Online (Sandbox Code Playgroud)

Gar*_*Jax 13

大概当然第二,我认为在大字符串中搜索或在小字符串中搜索时没有任何区别.由于线条较短,你可以跳过一些字符,但是分割操作也有成本(搜索\n,创建不同的字符串,创建列表),循环在python中完成.

字符串__contain__方法在C中实现,因此明显更快.

还要考虑第二个方法在找到第一个匹配后立即中止,但第一个方法在开始在其中搜索之前拆分所有字符串.

通过简单的基准测试可以快速证明这一点

import timeit

prepare = """
with open('bible.txt') as fh:
    text = fh.read()
"""

presplit_prepare = """
with open('bible.txt') as fh:
    text = fh.read()
lines = text.split('\\n')
"""

longsearch = """
'hello' in text
"""

splitsearch = """
for line in text.split('\\n'):
    if 'hello' in line:
        break
"""

presplitsearch = """
for line in lines:
    if 'hello' in line:
        break
"""


benchmark = timeit.Timer(longsearch, prepare)
print "IN on big string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(splitsearch, prepare)
print "IN on splitted string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(presplitsearch, presplit_prepare)
print "IN on pre-splitted string takes:", benchmark.timeit(1000), "seconds"
Run Code Online (Sandbox Code Playgroud)

结果是:

IN on big string takes: 4.27126097679 seconds
IN on splitted string takes: 35.9622690678 seconds
IN on pre-splitted string takes: 11.815297842 seconds
Run Code Online (Sandbox Code Playgroud)

bible.txt文件实际上圣经,我在这里找到它:http://patriot.net/~bmcgin/kjvpage.html (文本版)