如何使用 os.posix_fadvise 防止 Linux 上的文件缓存？

Question

如何使用 os.posix_fadvise 防止 Linux 上的文件缓存？

我有一个通常在整个块设备上运行的脚本，如果读取的每个块都被缓存，它将驱逐其他应用程序正在使用的数据。为了防止这种情况发生，我添加了对使用mmap(2)和posix_fadvise(2) 的支持，逻辑如下：

指示不再需要块的函数：

def advise_dont_need(fd, offset, length):
    """
    Announce that data in a particular location is no longer needed.

    Arguments:
    - fd (int): File descriptor.
    - offset (int): Beginning of the unneeded data.
    - length (int): Length of the unneeded data.
    """
    # TODO: macOS support
    if hasattr(os, "posix_fadvise"):
        # posix_fadvise(2) states that "If the application requires that data
        # be considered for discarding, then offset and len must be
        # page-aligned." When this code aligns the offset and length, the
        # advised area is widened under the presumption it is better to discard
        # more memory than needed than to leak it which could cause resource
        # issues.

        # If the offset is unaligned, extend it toward 0 to align it and adjust
        # the length to compensate for the change.
        aligned_offset = offset - offset % PAGE_SIZE
        length += offset - aligned_offset
        offset = aligned_offset

        # If the length is unaligned, widen it to align it.
        length -= length % -PAGE_SIZE

        os.posix_fadvise(fd, offset, length, os.POSIX_FADV_DONTNEED)

Run Code Online (Sandbox Code Playgroud)

读取文件的逻辑：

            with open(path, "rb", buffering=0) as file, \
              ProgressBar("Reading file") as progress, timer() as read_loop:
                size = file_size(file)

                if mmap_file:
                    # At the time of this writing, mmap.mmap in CPython uses
                    # st_size to determine the size of a file which will not
                    # work with every file type which is why file size
                    # autodetection (size=0) cannot be used here.
                    fd = file.fileno()
                    view = mmap.mmap(fd, size, prot=mmap.PROT_READ)

                try:
                    while writer.error is None and hash_queue.error is None:
                        # Skip offsets that are already in the block map.
                        if offset in blocks:
                            while offset in blocks:
                                if mmap_file:
                                    advise_dont_need(fd, offset, block_size)

                                offset += block_size

                            if not mmap_file:
                                file.seek(offset)

                        if mmap_file:
                            block = view[offset:offset + block_size]
                            advise_dont_need(fd, offset, len(block))
                        else:
                            block = file.read(block_size)

                        if not block:
                            break

                        bytes_read += len(block)

                        while hash_queue.error is None:
                            try:
                                hash_queue.put((offset, block), timeout=0.1)
                                offset += len(block)
                                progress.update(offset / size)
                                break
                            except queue.Full:
                                pass
                finally:
                    if mmap_file:
                        view.close()

Run Code Online (Sandbox Code Playgroud)

当我运行脚本并监视的输出时free -h，尽管有这种逻辑，但我可以看到缓冲区缓存使用量增加。我的逻辑是否不正确，或者这是posix_fadvise(2)的结果——建议与授权？

以下是一些日志，显示了在 block_size 设置为 1048576 的脚本执行结束时的长度和偏移量值：

offset=107296587776; length=1048576
offset=107297636352; length=1048576
offset=107298684928; length=1048576
offset=107299733504; length=1048576
offset=107300782080; length=1048576
offset=107301830656; length=1048576
offset=107302879232; length=1048576
offset=107303927808; length=1048576
offset=107304976384; length=0

Run Code Online (Sandbox Code Playgroud)

Answer 1

Tec*_*eks 1

您的脚本将导致应用程序数据被逐出这一说法并不完全准确。posix_fadvise 的用法也不完全是这样解释的。Linux 缓冲区和页面缓存的工作方式比这要复杂一些。

一、术语：

缓冲区高速缓存 - 用于原始块设备访问，通常在文件系统之外。单位是块。测试这些的好方法是 dd if=/dev/... （在块设备上）of=/dev/null。使用 time(1) 多次这样做应该会显示第二次及以后的时间明显减少。
页面缓存 - 用于基于文件系统的访问，传统上以整页为单位，由 inode 索引，因此每个文件仅维护一份副本。测试这些的好方法是 cp 或 cat 或实际上对大文件的任何访问，同样，几次 time(1) 应该显示时间减少和页面缓存使用增加（但对于同一文件不超过一次）

Linux 将尝试最大化两个缓存的使用。查看使用情况的常见方法是通过“free(1)”：

   [localhost ~]$ free
              total        used        free      shared  buff/cache   available
Mem:        3995408      633820     2241896        5820     1119692     3106196
Swap:       2138108      422408     1715700

Run Code Online (Sandbox Code Playgroud)

这里的缓冲区缓存是单独考虑的，并且不计为“已使用”，因为“已使用”是针对进程的。如果您确实需要进程/应用程序的内存，则优先，并且缓冲区/缓存将被清除。您可以通过对 malloc/memset 执行一个简单的程序来测试这一点，并观察缓存大小缩小（至最低限度，即几兆字节）。其他版本的free用来显示+/-缓存，这样更清晰）

应用程序内存使用：由匿名内存（malloc(3) 等的总和）和文件映射内存（MAP_FILE 上的 mmap(2)）组成。不过，后者算作文件高速缓存，而不算作应用程序内存。只要此类文件映射内存是干净的（只读或未修改），就可以安全地逐出。然而，前者（匿名）如果需要驱逐，只能去交换（因为没有支持文件）。

您使用的 posix_fadvise(2) 确实是建议。但是，如果有足够的可用内存，您的建议将无效 - 您说您不需要它，但实际上您确实读取了偏移量 - 因此Linux将缓存文件数据：有足够的内存来满足它，并且您最终可能会再次使用它，那么为什么不缓存它呢？它不会导致任何匿名内存的驱逐，也不会有显着的内存压力 - 如果在缓存中找到您的数据，它会节省几个数量级的时间（保存它会将 I/O 保存到磁盘/闪存，即 O(1000 +) 慢几倍)。

另一种看待这个问题的方式：DONTNEED 的 posix_fadvise 通常是当有一个巨大的文件时，但你说你只会访问它的某些部分，所以你告诉系统 - 不要缓存某些范围我不会正在使用。一旦你使用它们，建议就变得无关紧要了。

顺便说一句，您还可以直接将 madvise(2) 用于 mmap(2)ed 区域，以及 MADV_DONTNEED 等。

归档时间：	4 年，7 月前
查看次数：	152 次
最近记录：	4 年，7 月前