我需要更新一些超过2GB文件的最后一行,这些文件由无法读取的文本行组成readlines().目前,它通过逐行循环工作正常.但是,我想知道是否有任何编译库可以更有效地实现这一点?谢谢!
myfile = open("large.XML")
for line in myfile:
do_something()
Run Code Online (Sandbox Code Playgroud)
如果这确实是基于行的(真正的XML解析器不是最佳解决方案),mmap可以在这里提供帮助.
mmap该文件,然后调用.rfind('\n')生成的对象(可能需要调整以处理以换行符结尾的文件,当你真的想要它之前的非空行,而不是它后面的空"行").然后,您可以单独切出最后一行.如果需要在适当的位置修改文件,可以调整文件大小以削减(或添加)与您切片的行和新行之间的差异相对应的多个字节,然后写回新行.避免读取或写入超出您需要的文件.
示例代码(如果我犯了错误,请评论):
import mmap
# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
# len(mm) - 1 handles files ending w/newline by getting the prior line
# + 1 to avoid catching prior newline (and handle one line file seamlessly)
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# Get the line (with any newline stripped)
line = mm[startofline:].rstrip(b'\r\n')
# Do whatever calculates the new line, decoding/encoding to use str
# in do_something to simplify; this is an XML file, so I'm assuming UTF-8
new_line = do_something(line.decode('utf-8')).encode('utf-8')
# Resize to accommodate the new line (or to strip data beyond the new line)
mm.resize(startofline + len(new_line)) # + 1 if you need to add a trailing newline
mm[startofline:] = new_line # Replace contents; add a b"\n" if needed
Run Code Online (Sandbox Code Playgroud)
显然在某些系统(例如OSX)上没有mremap,mm.resize不会工作,所以为了支持这些系统,你可能会拆分with(所以mmap在文件对象之前关闭),并使用基于文件对象的搜索,写入和截断来修复文件.以下示例包括我之前提到的Python 3.1和早期特定的调整以contextlib.closing用于完整性:
import mmap
from contextlib import closing
with open("large.XML", 'r+b') as myfile:
with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline) # Move to where old line began
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
Run Code Online (Sandbox Code Playgroud)
相对于mmap任何其他方法的优点是:
rfind意味着你可以让Python在C层快速找到换行符(在CPython中); 文件对象的显式seeks和reads可以匹配"仅读取一页左右",但是您必须手动实现对换行的搜索警告: 如果您使用的是32位系统并且文件太大,这种方法将无效(至少,如果没有修改以避免映射超过2 GB,并且在整个文件可能未映射时处理调整大小)映射到内存.在大多数32位系统上,即使在新生成的进程中,您也只有1-2 GB的连续可用地址空间; 在某些特殊情况下,您可能拥有多达3-3.5 GB的用户虚拟地址(尽管您将丢失堆,堆栈,可执行映射等的一些连续空间).mmap不需要太多的物理RAM,但它需要连续的地址空间; 64位操作系统的一个巨大好处是,除了最荒谬的情况之外,你不再担心虚拟地址空间,因此mmap可以解决一般情况下无法处理的问题,而不会增加32位操作系统的复杂性.此时大多数现代计算机都是64位,但如果您的目标是32位系统,那么肯定要记住这一点(在Windows上,即使操作系统是64位,他们也可能安装了32位版本的Python错误,所以同样的问题适用).这是另一个有效的例子(假设最后一行不是100多MB长)在32位Python上(closing为了简洁省略和导入),即使对于大文件也是如此:
with open("large.XML", 'r+b') as myfile:
filesize = myfile.seek(0, 2)
# Get an offset that only grabs the last 100 MB or so of the file aligned properly
offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
# If line might be > 100 MB long, probably want to check if startofline
# follows a newline here
line = mm[startofline:].rstrip(b'\r\n')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline + offset) # Move to where old line began, adjusted for offset
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
Run Code Online (Sandbox Code Playgroud)