Sim*_*pom 13 python python-2.7
我有大的日志文件(从100MB到2GB),包含我需要在Python程序中解析的(单个)特定行.我必须解析大约20,000个文件.我知道搜索到的行在文件的最后200行内,或者在最后15000字节内.
由于这是一项反复出现的任务,我需要尽可能快地完成任务.获得它的最快方法是什么?
我考虑过4种策略:
以下是我为测试这些策略而创建的函数:
import os
import re
import subprocess
def method_1(filename):
"""Method 1: read whole file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
txt = f.read()
match = re.search(regex, txt)
if match:
print match.group()
def method_2(filename):
"""Method 2: read part of the file and regex"""
regex = r'\(TEMPS CP :[ ]*.*S\)'
with open(filename, 'r') as f:
size = min(15000, os.stat(filename).st_size)
f.seek(-size, os.SEEK_END)
txt = f.read(size)
match = re.search(regex, txt)
if match:
print match.group()
def method_3(filename):
"""Method 3: grep the entire file"""
cmd = 'grep "(TEMPS CP :" {} | head -n 1'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
def method_4(filename):
"""Method 4: tail of the file and grep"""
cmd = 'tail -n 200 {} | grep "(TEMPS CP :"'.format(filename)
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
print process.communicate()[0][:-1]
Run Code Online (Sandbox Code Playgroud)
我在两个文件上运行这些方法("trace"为207MB,"trace_big"为1.9GB)并获得以下计算时间(以秒为单位):
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_1 | 2.89E-001 | 2.63 |
| method_2 | 5.71E-004 | 5.01E-004 |
| method_3 | 2.30E-001 | 1.97 |
| method_4 | 4.94E-003 | 5.06E-003 |
+----------+-----------+-----------+
Run Code Online (Sandbox Code Playgroud)
所以method_2似乎是最快的.但是有没有其他我没想过的解决方案?
除了以前的方法,Gosha F建议使用mmap的第五种方法:
import contextlib
import math
import mmap
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
ag = mmap.ALLOCATIONGRANULARITY
offset = ag * (int(math.ceil(offset/ag)))
with open(filename, 'r') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)
with contextlib.closing(mm) as txt:
match = regex.search(txt)
if match:
print match.group()
Run Code Online (Sandbox Code Playgroud)
我测试了它并得到以下结果:
+----------+-----------+-----------+
| | trace | trace_big |
+----------+-----------+-----------+
| method_5 | 2.50E-004 | 2.71E-004 |
+----------+-----------+-----------+
Run Code Online (Sandbox Code Playgroud)
您也可以考虑使用这样的内存映射(mmap模块)
def method_5(filename):
"""Method 5: use memory mapping and regex"""
regex = re.compile(r'\(TEMPS CP :[ ]*.*S\)')
offset = max(0, os.stat(filename).st_size - 15000)
with open(filename, 'r') as f:
with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY, offset=offset)) as txt:
match = regex.search(txt)
if match:
print match.group()
Run Code Online (Sandbox Code Playgroud)
还有一些附注: