Python有8KiB字节的长文件I/O缓存吗？

Question

Python有8KiB字节的长文件I/O缓存吗？

use*_*424 1 python windows file-io caching python-3.x

我正在调查Python 3.6.0中的文件I/O性能.鉴于此脚本包含3个测试:

#!python3

import random, string, time

strs = ''.join(random.choice(string.ascii_lowercase) for i in range(1000000))
strb = bytes(strs, 'latin-1')

inf = open('bench.txt', 'w+b')
inf.write(strb)

for t in range(3):
    inf.seek(0)
    inf.read(8191)

for t in range(3):
    inf.seek(0)
    inf.read(8192)

for t in range(3):
    inf.seek(0)
    inf.read(8193)

inf.close()

Run Code Online (Sandbox Code Playgroud)

Procmon发现以下操作发生(标签行是我的评论):

  # Initial write
Offset: 0, Length: 1.000.000
  # The 3 8191-long reads only produce one syscall due to caching:
Offset: 0, Length: 8.192
  # However, if the read length is exactly 8192, python doesn't take advantage:
Offset: 0, Length: 8.192
Offset: 0, Length: 8.192
Offset: 0, Length: 8.192
  # Due to caching, the first syscall of the first read of the last loop is missing.
Offset: 8.192, Length: 8.192
Offset: 0, Length: 8.192
Offset: 8.192, Length: 8.192
Offset: 0, Length: 8.192
Offset: 8.192, Length: 8.192
 # Afterwards, 2 syscalls per read are produced on the 8193-long reads.

Run Code Online (Sandbox Code Playgroud)

首先,很明显python将以8KiB的倍数读取块的文件.

我怀疑python实现了一个缓存缓冲区,它存储了最后一个读取的8KiB块,如果你试图连续多次在同一个8KiB范围内读取它,它将简单地返回它并裁剪它.

有人可以确认python实际上是实现这种机制吗？

如果是这种情况,这意味着如果您不以某种方式手动使缓存无效,则python无法检测到外部应用程序对该块所做的更改.那是对的吗？也许有一种方法可以禁用这种机制？

或者,为什么正确的8192字节读取不能从缓存中受益？

Answer 1

Mar*_*ers 6

是的,默认缓冲区大小为8k.见io.DEFAULT_BUFFER_SIZE:

io.DEFAULT_BUFFER_SIZE
一个int含有由模块的缓冲I/O类使用默认的缓冲区大小.如果可能,open()使用文件blksize(如获得os.stat()).

和

>>> import io
>>> io.DEFAULT_BUFFER_SIZE
8192

Run Code Online (Sandbox Code Playgroud)

和模块源代码:

#define DEFAULT_BUFFER_SIZE (8 * 1024)  /* bytes */

Run Code Online (Sandbox Code Playgroud)

如果使用BufferedIOBase接口或包装器对文件进行更改,则缓冲区将自动更新(以二进制模式打开文件会生成BufferedIOBase子类BufferedReader,BufferedWriter或其中之一BufferedRandom).

对于你的第二种情况,你的seek()调用刷新了那个缓冲区,因为你在'当前'块范围之外寻找(当前位置是8192,第二个缓冲块的第一个字节,你寻找回来0,这是第一个缓冲的第一个字节)块).有关更多详细信息,请参阅源代码BufferedIOBase.seek()

如果您需要从其他进程编辑底层文件,使用seek()是确保在尝试再次读取之前删除缓冲区的好方法,或者您可以忽略缓冲区并通过该属性转到底层RawIOBase实现.BufferedIOBase.raw

归档时间：	8 年，3 月前
查看次数：	398 次
最近记录：	8 年，3 月前