Python将2GB的文本文件加载到内存中

Question

Python将2GB的文本文件加载到内存中

在Python 2.7中,当我将2.5GB文本文件中的所有数据加载到内存中时,可以更快地进行处理:

>>> f = open('dump.xml','r')
>>> dump = f.read()

Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

Python(62813) malloc: *** mmap(size=140521659486208) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

Run Code Online (Sandbox Code Playgroud)

为什么Python尝试140521659486208为2563749237字节数据分配字节内存？如何修复代码以使其加载所有字节？

我有大约3GB的RAM空闲.该文件是Wiktionary xml转储.

Answer 1

jos*_*inm 11

如果使用mmap,您将能够立即将整个文件加载到内存中.

import mmap

with open('dump.xml', 'rb') as f:
  # Size 0 will read the ENTIRE file into memory!
  m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) #File is open read-only

  # Proceed with your code here -- note the file is already in memory
  # so "readine" here will be as fast as could be
  data = m.readline()
  while data:
    # Do stuff
    data = m.readline()

Run Code Online (Sandbox Code Playgroud)

@pckben那是因为文件以只读模式打开,mmap会尝试映射读写:在你的`mmap.mmap`调用中添加`prot = mmap.PROT_READ`,你会没事的. (2认同)
mmap是文件的内存映射.访问分配位置的内存将访问该文件.OS是预先缓冲整个文件还是仅在访问时缓存,是配置的一部分;-) (2认同)

归档时间：	13 年，2 月前
查看次数：	12075 次
最近记录：	13 年，2 月前