如何快速获取一个巨大的 csv 文件的最后一行（48M 行）？

Question

如何快速获取一个巨大的 csv 文件的最后一行（48M 行）？

我有一个 csv 文件，它会一直增长到大约 48M 行。

在向它添加新行之前，我需要阅读最后一行。

我尝试了下面的代码，但它太慢了，我需要一个更快的替代方案：

def return_last_line(filepath):    
    with open(filepath,'r') as file:        
        for x in file:
            pass
        return x        
return_last_line('lala.csv')

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ser*_*nho 8

这是我在 python 中的看法：我创建了一个函数，可以让您选择最后几行，因为最后几行可能是空的。

def get_last_line(file, how_many_last_lines = 1):

    # open your file using with: safety first, kids!
    with open(file, 'r') as file:

        # find the position of the end of the file: end of the file stream
        end_of_file = file.seek(0,2)
        
        # set your stream at the end: seek the final position of the file
        file.seek(end_of_file)             
        
        # trace back each character of your file in a loop
        n = 0
        for num in range(end_of_file+1):            
            file.seek(end_of_file - num)    
           
            # save the last characters of your file as a string: last_line
            last_line = file.read()
           
            # count how many '\n' you have in your string: 
            # if you have 1, you are in the last line; if you have 2, you have the two last lines
            if last_line.count('\n') == how_many_last_lines: 
                return last_line
get_last_line('lala.csv', 2)

Run Code Online (Sandbox Code Playgroud)

这个 lala.csv 有 4800 万行，比如在你的例子中。我花了 0 秒才拿到最后一行。

这实际上是不正确的。对于 Unix 文本文件来说，“\n”计数太少了。一行由 \n *终止*，因此文本文件以 '\n' 结尾，默认情况下，您的 `get_last_line` 只会返回最后一行的 *行终止符*，而不是最后一行。 (3认同)

Answer 2

Ant*_*ala 7

这是查找文件最后一行的代码mmap，它应该适用于 Unixen 及其衍生产品和 Windows（我仅在 Linux 上测试过，请告诉我它是否也适用于 Windows ；），即几乎所有地方这很重要。由于它使用内存映射 I/O，因此可以预期它的性能非常好。

它期望您可以将整个文件映射到处理器的地址空间 - 对于 50M 文件无处不在应该没问题，但对于 5G 文件，您需要一个 64 位处理器或一些额外的切片。

import mmap


def iterate_lines_backwards(filename):
    with open(filename, "rb") as f:
        # memory-map the file, size 0 means whole file
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            start = len(mm)

            while start > 0:
                start, prev = mm.rfind(b"\n", 0, start), start
                slice = mm[start + 1:prev + 1]
                # if the last character in the file was a '\n',
                # technically the empty string after that is not a line.
                if slice:
                    yield slice.decode()


def get_last_nonempty_line(filename):
    for line in iterate_lines_backwards(filename):
        if stripped := line.rstrip("\r\n"):
            return stripped


print(get_last_nonempty_line("datafile.csv"))

Run Code Online (Sandbox Code Playgroud)

作为奖励，有一个生成器iterate_lines_backwards可以有效地以任意数量的行反向迭代文件的行：

print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
    print(l, end="")

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，11 月前
查看次数：	3292 次
最近记录：	4 年，11 月前