Bra*_*roy 7 python file-handling seek python-3.x
我正在尝试使用Python读取和处理大块文件。我正在关注这个博客,该博客提出了一种非常快速的方式来读取和处理散布在多个进程中的大量数据。我只稍微更新了现有代码,即使用stat(fin).st_sizeover os.path.getsize。在该示例中,我也未实现多处理,因为该问题在单个过程中也很明显。这使得调试更加容易。
我在这段代码中遇到的问题是它返回断句。这是有道理的:指针不考虑行尾,而只返回给定的字节大小。实际上,人们会假定您可以通过在提取的行中省略最后一项来解决此问题,因为这很可能是折线。不幸的是,这也不可靠。
from os import stat
def chunkify(pfin, buf_size=1024):
file_end = stat(pfin).st_size
with open(pfin, 'rb') as f:
chunk_end = f.tell()
while True:
chunk_start = chunk_end
f.seek(buf_size, 1)
f.readline()
chunk_end = f.tell()
yield chunk_start, chunk_end - chunk_start
if chunk_end > file_end:
break
def process_batch(pfin, chunk_start, chunk_size):
with open(pfin, 'r', encoding='utf-8') as f:
f.seek(chunk_start)
batch = f.read(chunk_size).splitlines()
# changing this to batch[:-1] will result in 26 lines total
return batch
if __name__ == '__main__':
fin = r'data/tiny.txt'
lines_n = 0
for start, size in chunkify(fin):
lines = process_batch(fin, start, size)
# Uncomment to see broken lines
# for line in lines:
# print(line)
# print('\n')
lines_n += len(lines)
print(lines_n)
# 29
Run Code Online (Sandbox Code Playgroud)
上面的代码将打印29为已处理行的总数。当您不返回批次的最后一项时,天真地假设那是一条折线,您将得到26。实际的行数是27。测试数据可以在下面找到。
She returned bearing mixed lessons from a society where the tools of democracy still worked.
If you think you can sense a "but" approaching, you are right.
Elsewhere, Germany take on Brazil and Argentina face Spain, possibly without Lionel Messi.
What sort of things do YOU remember best?'
Less than three weeks after taking over from Lotz at Wolfsburg.
The buildings include the Dr. John Micallef Memorial Library.
For women who do not have the genes, the risk drops to just 2% for ovarian cancer and 12% for breast cancer.
In one interview he claimed it was from the name of the Cornish language ("Kernewek").
8 Goldschmidt was out of office between 16 and 19 July 1970.
Last year a new law allowed police to shut any bar based on security concerns.
But, Frum explains: "Glenn Beck takes it into his head that this guy is bad news."
Carrying on the Romantic tradition of landscape painting.
This area has miles of undeveloped beach adjacent to the headlands.
The EAC was created in 2002 to help avoid a repeat of the disputed 2000 presidential election.
In May 1945, remnants of the German Army continue fight on in the Harz mountains, nicknamed "The Void" by American troops.
Dietler also said Abu El Haj was being opposed because she is of Palestinian descent.
The auction highlights AstraZeneca's current focus on boosting returns to shareholders as it heads into a wave of patent expiries on some of its biggest selling medicines including Nexium, for heartburn and stomach ulcers, and Seroquel for schizophrenia and bipolar disorder.
GAAP operating profit was $13.2 million and $7.1 million in the second quarter of 2008 and 2007, respectively.
Doc, Ira, and Rene are sent home as part of the seventh bond tour.
only I am sick of always hearing him called the Just.
Also there is Meghna River in the west of Brahmanbaria.
The explosives were the equivalent of more than three kilograms of dynamite - equal to 30 grenades," explained security advisor Markiyan Lubkivsky to reporters gathered for a news conference in Kyiv.
Her mother first took her daughter swimming at the age of three to help her with her cerebal palsy.
A U.S. aircraft carrier, the USS "Ticonderoga", was also stationed nearby.
Louis shocked fans when he unexpectedly confirmed he was expecting a child in summer 2015.
99, pp.
Sep 19: Eibar (h) WON 6-1
Run Code Online (Sandbox Code Playgroud)
如果打印出创建的行,您会发现确实出现了断句。我觉得这很奇怪。不应该f.readline()确保在下一行之前读取文件吗?在下面的输出中,空行将两个批次分开。这意味着您不能检查批处理中的下一行,如果它是子字符串,则不能将其删除-断句属于整个批处理中的另一批。
...
This area has miles of undeveloped beach adjacent to the headlands.
The EAC was created in 2002 to help avoid a repeat of the disputed 2000 presidential election.
In May 1945, r
In May 1945, remnants of the German Army continue fight on in the Harz mountains, nicknamed "The Void" by American troops.
...
Run Code Online (Sandbox Code Playgroud)
有没有办法消除这些断句而又不删掉太多?
您可以在此处下载更大的测试文件(100,000行)。
经过大量挖掘后,似乎实际上某些不可访问的缓冲区导致了seek的不一致行为,如此处和此处所讨论。我尝试了建议的解决方案以用于iter(f.readline, ''),seek但仍然给我不一致的结果。我已经更新了我的代码,以在每批1500行之后返回文件指针,但实际上批返回将重叠。
She returned bearing mixed lessons from a society where the tools of democracy still worked.
If you think you can sense a "but" approaching, you are right.
Elsewhere, Germany take on Brazil and Argentina face Spain, possibly without Lionel Messi.
What sort of things do YOU remember best?'
Less than three weeks after taking over from Lotz at Wolfsburg.
The buildings include the Dr. John Micallef Memorial Library.
For women who do not have the genes, the risk drops to just 2% for ovarian cancer and 12% for breast cancer.
In one interview he claimed it was from the name of the Cornish language ("Kernewek").
8 Goldschmidt was out of office between 16 and 19 July 1970.
Last year a new law allowed police to shut any bar based on security concerns.
But, Frum explains: "Glenn Beck takes it into his head that this guy is bad news."
Carrying on the Romantic tradition of landscape painting.
This area has miles of undeveloped beach adjacent to the headlands.
The EAC was created in 2002 to help avoid a repeat of the disputed 2000 presidential election.
In May 1945, remnants of the German Army continue fight on in the Harz mountains, nicknamed "The Void" by American troops.
Dietler also said Abu El Haj was being opposed because she is of Palestinian descent.
The auction highlights AstraZeneca's current focus on boosting returns to shareholders as it heads into a wave of patent expiries on some of its biggest selling medicines including Nexium, for heartburn and stomach ulcers, and Seroquel for schizophrenia and bipolar disorder.
GAAP operating profit was $13.2 million and $7.1 million in the second quarter of 2008 and 2007, respectively.
Doc, Ira, and Rene are sent home as part of the seventh bond tour.
only I am sick of always hearing him called the Just.
Also there is Meghna River in the west of Brahmanbaria.
The explosives were the equivalent of more than three kilograms of dynamite - equal to 30 grenades," explained security advisor Markiyan Lubkivsky to reporters gathered for a news conference in Kyiv.
Her mother first took her daughter swimming at the age of three to help her with her cerebal palsy.
A U.S. aircraft carrier, the USS "Ticonderoga", was also stationed nearby.
Louis shocked fans when he unexpectedly confirmed he was expecting a child in summer 2015.
99, pp.
Sep 19: Eibar (h) WON 6-1
Run Code Online (Sandbox Code Playgroud)
重叠批次的示例如下。最后一批的前两个半句子与之前一批的最后一个句子重复。我不知道如何解释或解决这个问题。
...
The EC ordered the SFA to conduct probes by June 30 and to have them confirmed by a certifying authority or it would deduct a part of the funding or the entire sum from upcoming EU subsidy payments.
Dinner for two, with wine, 250 lari.
It lies a few kilometres north of the slightly higher Weissmies and also close to the slightly lower Fletschhorn on the north.
For the rest we reached agreement and it was never by chance.
Chicago Blackhawks defeat Columbus Blue Jackets for 50th win
The only drawback in a personality that large is that no one els
For the rest we reached agreement and it was never by chance.
Chicago Blackhawks defeat Columbus Blue Jackets for 50th win
The only drawback in a personality that large is that no one else, whatever their insights or artistic pedigree, is quite as interesting.
Sajid Nadiadwala's reboot version of his cult classic "Judwaa", once again directed by David Dhawan titled "Judwaa 2" broke the dry spell running at the box office in 2017.
They warned that there will be a breaking point, although it is not clear what that would be.
...
Run Code Online (Sandbox Code Playgroud)
除此之外,我还尝试readline从原始代码中删除,并跟踪剩余的不完整块。然后将不完整的块传递到下一个块,并将其添加到其前面。我现在遇到的问题是,因为文本是按字节块读取的,所以可能发生块结尾而没有完全完成字符字节的情况。这将导致解码错误。
...
This area has miles of undeveloped beach adjacent to the headlands.
The EAC was created in 2002 to help avoid a repeat of the disputed 2000 presidential election.
In May 1945, r
In May 1945, remnants of the German Army continue fight on in the Harz mountains, nicknamed "The Void" by American troops.
...
Run Code Online (Sandbox Code Playgroud)
运行上面的代码,将不可避免地导致UnicodeDecodeError。
Traceback (most recent call last):
File "chunk_tester.py", line 46, in <module>
lines, left = process_batch(fin, start, size, last, left)
File "chunk_tester.py", line 24, in process_batch
chunk = f.read(chunk_size)
File "lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 0: invalid start byte
Run Code Online (Sandbox Code Playgroud)
你们离得太近了!对最终代码进行相对简单的更改(读取数据 asbytes和 not str)即可使其全部(几乎)正常工作。
主要问题是因为从二进制文件读取会计算bytes,但从文本文件读取会计算text,并且您第一次以字节为单位进行计数,第二次以字符为单位进行计数,导致您对已读取的数据的假设是错误的。这与内部隐藏缓冲区无关。
其他变化:
b'\n'而不是使用,并且只删除相关检测代码后面的bytes.splitlines()空行。chunkify否则可以用功能相同的更简单、更快的循环替换,而无需保持文件打开。这给出了最终代码:
from os import stat
def chunkify(pfin, buf_size=1024**2):
file_end = stat(pfin).st_size
i = -buf_size
for i in range(0, file_end - buf_size, buf_size):
yield i, buf_size, False
leftover = file_end % buf_size
if leftover == 0: # if the last section is buf_size in size
leftover = buf_size
yield i + buf_size, leftover, True
def process_batch(pfin, chunk_start, chunk_size, is_last, leftover):
with open(pfin, 'rb') as f:
f.seek(chunk_start)
chunk = f.read(chunk_size)
# Add previous leftover to current chunk
chunk = leftover + chunk
batch = chunk.split(b'\n')
# If this chunk is not the last one,
# pop the last item as that will be an incomplete sentence
# We return this leftover to use in the next chunk
if not is_last:
leftover = batch.pop(-1)
return [s.decode('utf-8') for s in filter(None, batch)], leftover
if __name__ == '__main__':
fin = r'ep+gutenberg+news+wiki.txt'
lines_n = 0
left = b''
for start, size, last in chunkify(fin):
lines, left = process_batch(fin, start, size, last, left)
if not lines:
continue
for line in lines:
print(line)
print('\n')
numberlines = len(lines)
lines_n += numberlines
print(lines_n)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
344 次 |
| 最近记录: |