在python中搜索文件的最后x行的最有效方法

Question

在python中搜索文件的最后x行的最有效方法

我有一个文件,我不知道它会有多大(它可能很大,但尺寸会有很大差异).我想搜索最后10行左右,看看是否有任何一个字符串匹配.我需要尽可能快速有效地做到这一点,并且想知道是否有更好的东西:

s = "foo"
last_bit = fileObj.readlines()[-10:]
for line in last_bit:
    if line == s:
        print "FOUND"

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dar*_*con 34

这是一个像MizardX的答案,但没有明显的问题,即在最坏的情况下采用二次时间来重新扫描工作字符串,因为添加了块.

与activestate解决方案(也似乎是二次方)相比,给定一个空文件不会爆炸,并且每个块读取而不是两个.

与产卵'尾巴'相比,这是独立的.(但如果你有它,'尾巴'是最好的.)

相比于从末端抓取几个KB并希望它足够,这适用于任何行长度.

import os

def reversed_lines(file):
    "Generate the lines of file in reverse order."
    part = ''
    for block in reversed_blocks(file):
        for c in reversed(block):
            if c == '\n' and part:
                yield part[::-1]
                part = ''
            part += c
    if part: yield part[::-1]

def reversed_blocks(file, blocksize=4096):
    "Generate blocks of file's contents in reverse order."
    file.seek(0, os.SEEK_END)
    here = file.tell()
    while 0 < here:
        delta = min(blocksize, here)
        here -= delta
        file.seek(here, os.SEEK_SET)
        yield file.read(delta)

Run Code Online (Sandbox Code Playgroud)

按要求使用它:

from itertools import islice

def check_last_10_lines(file, key):
    for line in islice(reversed_lines(file), 10):
        if line.rstrip('\n') == key:
            print 'FOUND'
            break

Run Code Online (Sandbox Code Playgroud)

编辑:将map()更改为head()中的itertools.imap().编辑2:简化reverse_blocks().编辑3:避免重新扫描新行的尾部.编辑4:重写者reverse_lines()因为str.splitlines()忽略了最后的'\n',正如BrianB注意到的那样(谢谢).

请注意,在非常旧的Python版本中,循环中的字符串连接将采用二次时间.CPython至少在过去几年中自动避免了这个问题.

Answer 2

Pab*_*loG 33

# Tail
from __future__ import with_statement

find_str = "FIREFOX"                    # String to find
fname = "g:/autoIt/ActiveWin.log_2"     # File to check

with open(fname, "r") as f:
    f.seek (0, 2)           # Seek @ EOF
    fsize = f.tell()        # Get Size
    f.seek (max (fsize-1024, 0), 0) # Set pos @ last n chars
    lines = f.readlines()       # Read to end

lines = lines[-10:]    # Get last 10 lines

# This returns True if any line is exactly find_str + "\n"
print find_str + "\n" in lines

# If you're searching for a substring
for line in lines:
    if find_str in line:
        print True
        break

Run Code Online (Sandbox Code Playgroud)

如果文件的行很长,则会失败.该代码假定最后10行属于最后1k的数据.应该检查至少有11行,或者在该条件为真之前继续向后搜索. (13认同)
lines [: - 10]删除最后10行.你想要的是行[-10:]. (2认同)

Answer 3

Myr*_*rys 8

如果您在POSIX系统上运行Python,则可以使用'tail -10'来检索最后几行.这可能比编写自己的Python代码以获得最后10行更快.而不是直接打开文件,从命令'tail -10 filename'打开一个管道.如果您确定日志输出(例如,您知道从来没有任何超长数百或数千个字符的长行),那么使用列出的"读取最后2KB"方法之一就可以了.

Answer 4

Rya*_*rom 7

我认为读取文件的最后2 KB左右应该确保你获得10行,并且不应该过多地占用资源.

file_handle = open("somefile")
file_size = file_handle.tell()
file_handle.seek(max(file_size - 2*1024, 0))

# this will get rid of trailing newlines, unlike readlines()
last_10 = file_handle.read().splitlines()[-10:]

assert len(last_10) == 10, "Only read %d lines" % len(last_10)

Run Code Online (Sandbox Code Playgroud)

Answer 5

mha*_*wke 5

这是一个使用mmap它的版本似乎非常有效.最重要的是,mmap它将自动处理文件到内存的分页要求.

import os
from mmap import mmap

def lastn(filename, n):
    # open the file and mmap it
    f = open(filename, 'r+')
    m = mmap(f.fileno(), os.path.getsize(f.name))

    nlcount = 0
    i = m.size() - 1 
    if m[i] == '\n': n += 1
    while nlcount < n and i > 0:
        if m[i] == '\n': nlcount += 1
        i -= 1
    if i > 0: i += 2

    return m[i:].splitlines()

target = "target string"
print [l for l in lastn('somefile', 10) if l == target]

Run Code Online (Sandbox Code Playgroud)

归档时间：	17 年，3 月前
查看次数：	36068 次
最近记录：	7 年，6 月前