使用python以相反的顺序读取文件

Nim*_*mmy 112 python reverse file

如何使用python以相反的顺序读取文件?我想从最后一行到第一行读取一个文件.

小智 133

作为生成器编写的正确,有效的答案.

import os

def reverse_readline(filename, buf_size=8192):
    """A generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
        segment = None
        offset = 0
        fh.seek(0, os.SEEK_END)
        file_size = remaining_size = fh.tell()
        while remaining_size > 0:
            offset = min(file_size, offset + buf_size)
            fh.seek(file_size - offset)
            buffer = fh.read(min(remaining_size, buf_size))
            remaining_size -= buf_size
            lines = buffer.split('\n')
            # The first line of the buffer is probably not a complete line so
            # we'll save it and append it to the last line of the next buffer
            # we read
            if segment is not None:
                # If the previous chunk starts right from the beginning of line
                # do not concat the segment to the last line of new chunk.
                # Instead, yield the segment first 
                if buffer[-1] != '\n':
                    lines[-1] += segment
                else:
                    yield segment
            segment = lines[0]
            for index in range(len(lines) - 1, 0, -1):
                if lines[index]:
                    yield lines[index]
        # Don't yield None if the file was empty
        if segment is not None:
            yield segment
Run Code Online (Sandbox Code Playgroud)

  • 编辑完成后,这在python 3.5中完美运行.这个问题的最佳答案. (6认同)
  • 这对于python> = 3.2中的*text*文件不起作用,因为由于某种原因,不再支持相对于文件末尾的搜索.可以通过保存`fh.seek(0,os.SEEK_END)`返回的文件大小并更改`fh.seek(-offset,os.SEEK_END)```fh.seek(file_size - offset)来修复` . (4认同)
  • 请注意,对于文本文件,这可能无法按预期工作。以相反的顺序正确获取块仅适用于二进制文件。问题是对于多字节编码的文本文件(例如`utf8`),`seek()`和`read()`指的是不同的大小。这可能也是不支持相对于 `os.SEEK_END` 的 `seek()` 的第一个非零参数的原因。 (4认同)
  • 恢复[this change](http://stackoverflow.com/review/suggested-edits/10428792)for python 2,其中`fh.seek()`返回`None` (3认同)
  • 简单:`'aöaö'.encode()`是`b'a \ xc3 \ xb6a \ xc3 \ xb6'。如果将其保存到磁盘上然后以文本模式读取,则在执行“ seek(2)”操作时它将移动两个字节,因此“ seek(2); read(1)将导致错误UnicodeDecodeError:'utf-8'编解码器无法解码位置0:无效的起始字节的字节0xb6,但是如果您执行eek(0); read(2); read(1),将得到您所期望的“ a”,即:eek()从未意识到编码,read()是您以文本模式打开文件。现在,如果具有'aöaö'* 1000000',则您的块将无法正确对齐。 (2认同)

Mat*_*ner 69

for line in reversed(open("filename").readlines()):
    print line.rstrip()
Run Code Online (Sandbox Code Playgroud)

在Python 3中:

for line in reversed(list(open("filename"))):
    print(line.rstrip())
Run Code Online (Sandbox Code Playgroud)

  • 唉,如果你不能将整个文件放在内存中,这不起作用. (173认同)
  • 此外,虽然发布的代码确实回答了这个问题,但我们应该小心关闭我们打开的文件。`with` 语句通常很轻松。 (7认同)
  • @MichaelDavidWatson:您可以反向读取文件而无需将其读入内存,但它非常重要,需要大量缓冲区才能避免大量的系统调用浪费.它也会表现得非常糟糕(尽管如果文件超过可用内存,则比将整个内存读入内存要好). (3认同)

Ber*_*pac 20

这样的事情怎么样:

import os


def readlines_reverse(filename):
    with open(filename) as qfile:
        qfile.seek(0, os.SEEK_END)
        position = qfile.tell()
        line = ''
        while position >= 0:
            qfile.seek(position)
            next_char = qfile.read(1)
            if next_char == "\n":
                yield line[::-1]
                line = ''
            else:
                line += next_char
            position -= 1
        yield line[::-1]


if __name__ == '__main__':
    for qline in readlines_reverse(raw_input()):
        print qline
Run Code Online (Sandbox Code Playgroud)

由于文件是按相反的顺序逐字读取的,因此只要单个行适合内存,它甚至可以在非常大的文件上工作.


use*_*751 18

你也可以使用python模块file_read_backwards.

安装后,通过pip install file_read_backwards(v1.2.1),您可以通过以下内容高效的方式向后(按行)读取整个文件:

#!/usr/bin/env python2.7

from file_read_backwards import FileReadBackwards

with FileReadBackwards("/path/to/file", encoding="utf-8") as frb:
    for l in frb:
         print l
Run Code Online (Sandbox Code Playgroud)

它支持"utf-8","latin-1"和"ascii"编码.

python3也支持.更多文档可以在http://file-read-backwards.readthedocs.io/en/latest/readme.html找到

  • 这适用于 UTF-8 等多字节编码。查找/读取解决方案不:seek() 以字节为单位计数,read() 以字符为单位计数。 (2认同)

Aza*_*kov 13

接受的答案不适用于大文件无法放入内存的情况(这并不罕见)。

正如其他人所指出的,@srohde 的答案看起来不错,但它还有下一个问题:

  • 打开文件看起来多余,当我们可以传递文件对象并将其留给用户决定应该以哪种编码读取时,
  • 即使我们重构接受文件对象,它也不适用于所有编码:我们可以选择具有utf-8编码和非 ascii 内容的文件,例如
?
Run Code Online (Sandbox Code Playgroud)

通过buf_size等于1并且将有

?
Run Code Online (Sandbox Code Playgroud)

当然,文本可能更大,但buf_size可能会被拾取,因此会导致上述混淆错误,

  • 我们不能指定自定义行分隔符,
  • 我们不能选择保留行分隔符。

因此,考虑到所有这些问题,我编写了单独的函数:

  • 一种适用于字节流的方法,
  • 第二个处理文本流并将其底层字节流委托给第一个并解码结果行。

首先让我们定义下一个效用函数:

ceil_division用于与天花板进行分隔(与//带有地板的标准分隔相比,可以在此线程中找到更多信息)

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xb9 in position 0: invalid start byte
Run Code Online (Sandbox Code Playgroud)

split 用于通过给定的分隔符从右端拆分字符串并能够保留它:

def ceil_division(left_number, right_number):
    """
    Divides given numbers with ceiling.
    """
    return -(-left_number // right_number)
Run Code Online (Sandbox Code Playgroud)

read_batch_from_end 从二进制流的右端读取批处理

def split(string, separator, keep_separator):
    """
    Splits given string by given separator.
    """
    parts = string.split(separator)
    if keep_separator:
        *parts, last_part = parts
        parts = [part + separator for part in parts]
        if last_part:
            return parts + [last_part]
    return parts
Run Code Online (Sandbox Code Playgroud)

之后,我们可以定义以相反顺序读取字节流的函数,例如

def read_batch_from_end(byte_stream, size, end_position):
    """
    Reads batch from the end of given byte stream.
    """
    if end_position > size:
        offset = end_position - size
    else:
        offset = 0
        size = end_position
    byte_stream.seek(offset)
    return byte_stream.read(size)
Run Code Online (Sandbox Code Playgroud)

最后一个用于反转文本文件的函数可以定义为:

import functools
import itertools
import os
from operator import methodcaller, sub


def reverse_binary_stream(byte_stream, batch_size=None,
                          lines_separator=None,
                          keep_lines_separator=True):
    if lines_separator is None:
        lines_separator = (b'\r', b'\n', b'\r\n')
        lines_splitter = methodcaller(str.splitlines.__name__,
                                      keep_lines_separator)
    else:
        lines_splitter = functools.partial(split,
                                           separator=lines_separator,
                                           keep_separator=keep_lines_separator)
    stream_size = byte_stream.seek(0, os.SEEK_END)
    if batch_size is None:
        batch_size = stream_size or 1
    batches_count = ceil_division(stream_size, batch_size)
    remaining_bytes_indicator = itertools.islice(
            itertools.accumulate(itertools.chain([stream_size],
                                                 itertools.repeat(batch_size)),
                                 sub),
            batches_count)
    try:
        remaining_bytes_count = next(remaining_bytes_indicator)
    except StopIteration:
        return

    def read_batch(position):
        result = read_batch_from_end(byte_stream,
                                     size=batch_size,
                                     end_position=position)
        while result.startswith(lines_separator):
            try:
                position = next(remaining_bytes_indicator)
            except StopIteration:
                break
            result = (read_batch_from_end(byte_stream,
                                          size=batch_size,
                                          end_position=position)
                      + result)
        return result

    batch = read_batch(remaining_bytes_count)
    segment, *lines = lines_splitter(batch)
    yield from lines[::-1]
    for remaining_bytes_count in remaining_bytes_indicator:
        batch = read_batch(remaining_bytes_count)
        lines = lines_splitter(batch)
        if batch.endswith(lines_separator):
            yield segment
        else:
            lines[-1] += segment
        segment, *lines = lines
        yield from lines[::-1]
    yield segment
Run Code Online (Sandbox Code Playgroud)

测试

准备工作

我使用fsutil命令生成了 4 个文件:

  1. empty.txt没有内容,大小为 0MB
  2. tiny.txt,大小为 1MB
  3. small.txt大小为 10MB
  4. 大小为 50MB 的large.txt

我还重构了@srohde 解决方案以使用文件对象而不是文件路径。

测试脚本

import codecs


def reverse_file(file, batch_size=None,
                 lines_separator=None,
                 keep_lines_separator=True):
    encoding = file.encoding
    if lines_separator is not None:
        lines_separator = lines_separator.encode(encoding)
    yield from map(functools.partial(codecs.decode,
                                     encoding=encoding),
                   reverse_binary_stream(
                           file.buffer,
                           batch_size=batch_size,
                           lines_separator=lines_separator,
                           keep_lines_separator=keep_lines_separator))
Run Code Online (Sandbox Code Playgroud)

注意:我已经使用collections.dequeclass 来耗尽发电机。

输出

对于 Windows 10 上的 PyPy 3.5:

from timeit import Timer

repeats_count = 7
number = 1
create_setup = ('from collections import deque\n'
                'from __main__ import reverse_file, reverse_readline\n'
                'file = open("{}")').format
srohde_solution = ('with file:\n'
                   '    deque(reverse_readline(file,\n'
                   '                           buf_size=8192),'
                   '          maxlen=0)')
azat_ibrakov_solution = ('with file:\n'
                         '    deque(reverse_file(file,\n'
                         '                       lines_separator="\\n",\n'
                         '                       keep_lines_separator=False,\n'
                         '                       batch_size=8192), maxlen=0)')
print('reversing empty file by "srohde"',
      min(Timer(srohde_solution,
                create_setup('empty.txt')).repeat(repeats_count, number)))
print('reversing empty file by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('empty.txt')).repeat(repeats_count, number)))
print('reversing tiny file (1MB) by "srohde"',
      min(Timer(srohde_solution,
                create_setup('tiny.txt')).repeat(repeats_count, number)))
print('reversing tiny file (1MB) by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('tiny.txt')).repeat(repeats_count, number)))
print('reversing small file (10MB) by "srohde"',
      min(Timer(srohde_solution,
                create_setup('small.txt')).repeat(repeats_count, number)))
print('reversing small file (10MB) by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('small.txt')).repeat(repeats_count, number)))
print('reversing large file (50MB) by "srohde"',
      min(Timer(srohde_solution,
                create_setup('large.txt')).repeat(repeats_count, number)))
print('reversing large file (50MB) by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('large.txt')).repeat(repeats_count, number)))
Run Code Online (Sandbox Code Playgroud)

对于 Windows 10 上的 CPython 3.5:

reversing empty file by "srohde" 8.31e-05
reversing empty file by "Azat Ibrakov" 0.00016090000000000028
reversing tiny file (1MB) by "srohde" 0.160081
reversing tiny file (1MB) by "Azat Ibrakov" 0.09594989999999998
reversing small file (10MB) by "srohde" 8.8891863
reversing small file (10MB) by "Azat Ibrakov" 5.323388100000001
reversing large file (50MB) by "srohde" 186.5338368
reversing large file (50MB) by "Azat Ibrakov" 99.07450229999998
Run Code Online (Sandbox Code Playgroud)

因此,正如我们所见,它的性能与原始解决方案相似,但更通用且没有上面列出的缺点。


广告

我已将此添加到具有许多经过良好测试的功能/迭代实用程序0.3.0lz软件包版本(需要Python 3.5 +)中。

可以像

reversing empty file by "srohde" 3.600000000000001e-05
reversing empty file by "Azat Ibrakov" 4.519999999999958e-05
reversing tiny file (1MB) by "srohde" 0.01965560000000001
reversing tiny file (1MB) by "Azat Ibrakov" 0.019207699999999994
reversing small file (10MB) by "srohde" 3.1341862999999996
reversing small file (10MB) by "Azat Ibrakov" 3.0872588000000007
reversing large file (50MB) by "srohde" 82.01206720000002
reversing large file (50MB) by "Azat Ibrakov" 82.16775059999998
Run Code Online (Sandbox Code Playgroud)

它支持所有标准编码(也许除了utf-7因为我很难定义生成可使用它编码的字符串的策略)。


gho*_*g74 8

for line in reversed(open("file").readlines()):
    print line.rstrip()
Run Code Online (Sandbox Code Playgroud)

如果您使用的是Linux,则可以使用tac命令.

$ tac file
Run Code Online (Sandbox Code Playgroud)

您可以在此处此处找到ActiveState中的2个食谱

  • 我想知道reverse()是否在迭代之前消耗了整个序列。文档说需要一个 __reversed__()` 方法,但是 python2.5 不会抱怨没有它的自定义类。 (2认同)

Ign*_*ams 8

import re

def filerev(somefile, buffer=0x20000):
  somefile.seek(0, os.SEEK_END)
  size = somefile.tell()
  lines = ['']
  rem = size % buffer
  pos = max(0, (size // buffer - 1) * buffer)
  while pos >= 0:
    somefile.seek(pos, os.SEEK_SET)
    data = somefile.read(rem + buffer) + lines[0]
    rem = 0
    lines = re.findall('[^\n]*\n?', data)
    ix = len(lines) - 2
    while ix > 0:
      yield lines[ix]
      ix -= 1
    pos -= buffer
  else:
    yield lines[0]

with open(sys.argv[1], 'r') as f:
  for line in filerev(f):
    sys.stdout.write(line)
Run Code Online (Sandbox Code Playgroud)

  • 那个问题没有提到性能,所以我不能挑选正则表达式的性能灾难:P (3认同)