我需要在一个非常大的文本文件中搜索特定的字符串.它是一个包含大约5000行文本的构建日志.什么是最好的方式去做?使用正则表达式应该不会造成任何问题吗?我将继续读取行块,并使用简单的查找.
eum*_*iro 48
如果它是"非常大"的文件,则按顺序访问行并且不要将整个文件读入内存:
with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something
Run Code Online (Sandbox Code Playgroud)
Jos*_*shD 15
你可以做一个简单的发现:
f = open('file.txt', 'r')
lines = f.read()
answer = lines.find('string')
Run Code Online (Sandbox Code Playgroud)
如果你可以逃脱它,一个简单的发现将比正则表达式快得多.
lau*_*sia 13
以下函数适用于文本文件和二进制文件(尽管只返回字节数中的位置),它确实有利于查找字符串,即使它们与行或缓冲区重叠,也不会在搜索行或缓冲区时找到.
def fnd(fname, s, start=0):
with open(fname, 'rb') as f:
fsize = os.path.getsize(fname)
bsize = 4096
buffer = None
if start > 0:
f.seek(start)
overlap = len(s) - 1
while True:
if (f.tell() >= overlap and f.tell() < fsize):
f.seek(f.tell() - overlap)
buffer = f.read(bsize)
if buffer:
pos = buffer.find(s)
if pos >= 0:
return f.tell() - (len(buffer) - pos)
else:
return -1
Run Code Online (Sandbox Code Playgroud)
这背后的想法是:
我使用这样的东西来查找较大的ISO9660文件中的文件签名,这是非常快的并且没有使用太多内存,你也可以使用更大的缓冲区来加快速度.
我已经开始整理文件文本搜索的多处理示例了.这是我第一次使用多处理模块; 而我是一个python n00b.评论相当欢迎.我必须等到工作才能测试真正的大文件.它在多核系统上应该比单核搜索更快.Bleagh!如何找到文本并如何可靠地报告行号,如何停止流程?
import multiprocessing, os, time
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
def FindText( host, file_name, text):
file_size = os.stat(file_name ).st_size
m1 = open(file_name, "r")
#work out file size to divide up to farm out line counting
chunk = (file_size / NUMBER_OF_PROCESSES ) + 1
lines = 0
line_found_at = -1
seekStart = chunk * (host)
seekEnd = chunk * (host+1)
if seekEnd > file_size:
seekEnd = file_size
if host > 0:
m1.seek( seekStart )
m1.readline()
line = m1.readline()
while len(line) > 0:
lines += 1
if text in line:
#found the line
line_found_at = lines
break
if m1.tell() > seekEnd or len(line) == 0:
break
line = m1.readline()
m1.close()
return host,lines,line_found_at
# Function run by worker processes
def worker(input, output):
for host,file_name,text in iter(input.get, 'STOP'):
output.put(FindText( host,file_name,text ))
def main(file_name,text):
t_start = time.time()
# Create queues
task_queue = multiprocessing.Queue()
done_queue = multiprocessing.Queue()
#submit file to open and text to find
print 'Starting', NUMBER_OF_PROCESSES, 'searching workers'
for h in range( NUMBER_OF_PROCESSES ):
t = (h,file_name,text)
task_queue.put(t)
#Start worker processes
for _i in range(NUMBER_OF_PROCESSES):
multiprocessing.Process(target=worker, args=(task_queue, done_queue)).start()
# Get and print results
results = {}
for _i in range(NUMBER_OF_PROCESSES):
host,lines,line_found = done_queue.get()
results[host] = (lines,line_found)
# Tell child processes to stop
for _i in range(NUMBER_OF_PROCESSES):
task_queue.put('STOP')
# print "Stopping Process #%s" % i
total_lines = 0
for h in range(NUMBER_OF_PROCESSES):
if results[h][1] > -1:
print text, 'Found at', total_lines + results[h][1], 'in', time.time() - t_start, 'seconds'
break
total_lines += results[h][0]
if __name__ == "__main__":
main( file_name = 'testFile.txt', text = 'IPI1520' )
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
113626 次 |
| 最近记录: |