python只读取大文本文件的结尾

Question

python只读取大文本文件的结尾

可能重复:
使用Python获取文件的最后n行,类似于tail
使用python以相反的顺序读取文件

我有一个大约15GB的文件,它是一个日志文件,我应该分析输出.我已经对一个类似但非常小的文件进行了基本解析,只需几行日志记录.解析字符串不是问题.问题是巨大的文件及其包含的冗余数据量.

基本上我正在尝试制作一个我可以说的python脚本; 例如,给我5000个文件的最后一行.这又是基本的处理论点和所有这些,没有什么特别的,我可以做到这一点.

但是,如何定义或告诉文件阅读器只读取我从文件末尾指定的行数？我试图跳过文件开头的huuuuuuge数量,因为我对这些不感兴趣,说实话,从txt文件中读取大约15GB的行需要太长时间.有没有办法犯错误...从文件末尾开始阅读？这甚至有意义吗？

这一切都归结为读取15GB文件的问题,一行一行需要太长时间.所以我想在开始时跳过已经冗余的数据(至少对我来说是冗余的),只读取我想要读取的文件末尾的行数.

明显的答案是手动只是将N行的数量从文件复制到另一个文件,但有没有办法半自动地神奇地只是用python读取文件末尾的N行数？

Answer 1

use*_*095 16

将其归于unix:

import os
os.popen('tail -n 1000 filepath').read()

Run Code Online (Sandbox Code Playgroud)

如果你需要能够访问stderr(以及其他一些功能),请使用subprocess.Popen而不是os.popen

Answer 2

Mar*_*ers 13

您需要寻找文件的末尾,然后从末尾读取一些块,计算行数,直到找到足够的换行符来读取n行.

基本上,您正在重新实现一种简单的尾部形式.

这里有一些经过严格测试的代码就是这样:

import os, errno

def lastlines(hugefile, n, bsize=2048):
    # get newlines type, open in universal mode to find it
    with open(hugefile, 'rU') as hfile:
        if not hfile.readline():
            return  # empty, no point
        sep = hfile.newlines  # After reading a line, python gives us this
    assert isinstance(sep, str), 'multiple newline types found, aborting'

    # find a suitable seek position in binary mode
    with open(hugefile, 'rb') as hfile:
        hfile.seek(0, os.SEEK_END)
        linecount = 0
        pos = 0

        while linecount <= n + 1:
            # read at least n lines + 1 more; we need to skip a partial line later on
            try:
                hfile.seek(-bsize, os.SEEK_CUR)           # go backwards
                linecount += hfile.read(bsize).count(sep) # count newlines
                hfile.seek(-bsize, os.SEEK_CUR)           # go back again
            except IOError, e:
                if e.errno == errno.EINVAL:
                    # Attempted to seek past the start, can't go further
                    bsize = hfile.tell()
                    hfile.seek(0, os.SEEK_SET)
                    pos = 0
                    linecount += hfile.read(bsize).count(sep)
                    break
                raise  # Some other I/O exception, re-raise
            pos = hfile.tell()

    # Re-open in text mode
    with open(hugefile, 'r') as hfile:
        hfile.seek(pos, os.SEEK_SET)  # our file position from above

        for line in hfile:
            # We've located n lines *or more*, so skip if needed
            if linecount > n:
                linecount -= 1
                continue
            # The rest we yield
            yield line

Run Code Online (Sandbox Code Playgroud)

Answer 3

Mik*_*ike -2

此时首选的方法是使用 unix 的 tail 来完成这项工作，并修改 python 以通过 std input 接受输入。

tail hugefile.txt -n1000 | python magic.py

Run Code Online (Sandbox Code Playgroud)

这没什么性感的，但至少它能照顾好工作。我发现，大文件是一个太大的负担，难以处理。至少对于我的Python技能来说是这样。因此，只需添加一点 nix 魔法来减少文件大小就容易多了。尾巴对我来说是新的。学到了一些东西，并找出了另一种使用终端的方法，再次为我带来优势。谢谢大家。

归档时间：	13 年，9 月前
查看次数：	13695 次
最近记录：	7 年，9 月前