使用python生成器处理大型文本文件

Question

使用python生成器处理大型文本文件

use*_*260 3 python generator chunks large-files

我是使用生成器的新手，已经阅读了一些，但是需要一些帮助来处理大块文本文件。我知道已经讨论了该主题，但是示例代码的解释非常有限，如果不了解正在发生的事情，则很难修改代码。

我的问题很简单，我有一系列大型文本文件，其中包含以下格式的人类基因组测序数据：

chr22   1   0
chr22   2   0
chr22   3   1
chr22   4   1
chr22   5   1
chr22   6   2

Run Code Online (Sandbox Code Playgroud)

文件长度在1Gb到〜20Gb之间，太大而无法读入RAM。所以我想一次读取例如10000行的块/箱中的行，以便我可以在这些箱大小的最后一列上执行计算。

基于此链接，我编写了以下内容：

def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    bin_size=5000
    start=0
    end=start+bin_size

    # Read a block from the file: data
    while True:
        data = file_object.readlines(end) 
        if not data:
            break
        start=start+bin_size
        end=end+bin_size
        yield data


def process_file(path):

    try:
        # Open a connection to the file
        with open(path) as file_handler:
            # Create a generator object for the file: gen_file
            for block in read_large_file(file_handler):
                print(block)
                # process block

    except (IOError, OSError):
        print("Error opening / processing file")    
    return    

if __name__ == '__main__':
            path='C:/path_to/input.txt'
    process_file(path)

Run Code Online (Sandbox Code Playgroud)

在'process_block'中，我希望返回的'block'对象是一个10000个元素长的列表，但不是吗？第一个列表是843个元素。第二个是2394元素？

我想在一个块中找回'N'行数，但对这里发生的事情感到非常困惑？

这里的解决方案似乎可以解决问题，但我又一次不明白如何修改它以一次读取N行？

在这里，这看起来也是一个非常不错的解决方案，但是同样，没有足够的背景知识让我理解不足以修改代码。

任何帮助将非常感激？

Answer 1

paw*_*moy 10

与其在文件中使用偏移量，不如尝试从循环中生成并产生10000个元素的列表：

def read_large_file(file_handler, block_size=10000):
    block = []
    for line in file_handler:
        block.append(line)
        if len(block) == block_size:
            yield block
            block = []

    # don't forget to yield the last block
    if block:
        yield block

with open(path) as file_handler:
    for block in read_large_file(file_handler):
        print(block)

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	4763 次
最近记录：	6 年，5 月前